首页    期刊浏览 2024年11月07日 星期四
登录注册

文章基本信息

  • 标题:IRS for Computer Character Sequences Filtration: a new software tool and algorithm to support the IRS at tokenization process
  • 本地全文:下载
  • 作者:Ahmad Al Badawi ; Qasem Abu Al-Haija
  • 期刊名称:International Journal of Advanced Computer Science and Applications(IJACSA)
  • 印刷版ISSN:2158-107X
  • 电子版ISSN:2156-5570
  • 出版年度:2013
  • 卷号:4
  • 期号:2
  • DOI:10.14569/IJACSA.2013.040212
  • 出版社:Science and Information Society (SAI)
  • 摘要:Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. A token is an instance of token a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. New software tool and algorithm to support the IRS at tokenization process are presented. Our proposed tool will filter out the three computer character Sequences: IP-Addresses, Web URLs, Date, and Email Addresses. Our tool will use the pattern matching algorithms and filtration methods. After this process, the IRS can start a new tokenization process on the new retrieved text which will be free of these sequences.
  • 关键词:thesai; IJACSA; thesai.org; journal; IJACSA papers; Information Retrieval; Tokenization; pattern matching; and Sequences Filtration.
国家哲学社会科学文献中心版权所有