Arabic Text Preprocessing for the Natural Language Processing Applications

Arafat Awajan

Arabic Text Preprocessing for the Natural Language Processing Applications PDF

Author(s): Arafat Awajan

Article publication date: 2007-12-01

Vol. 25 No. 4 (yearly), pp. 179-189.

DOI:

467

Keywords

Arabic Text Preprocessing. Stemming, Morphological Analysis. Text Annotation, Pari of speech tagging

Abstract

A new approach for preprocessing vowelized and un vowelized Arabic texts in order to prepare them for Natural Language Processing (NLP) purposes is described. The developed approach is rule-based and made up of four phases: text tokenization, word light stemming, words' morphological analysis, and text annotation. The first phase preprocesses the input text in order to isolate the words and represent them in a formal way The second phase applies a light stemmer in order to extract the stem of each word by eliminating the prefixes and suffixes. The third phase is a rule-based morphological analyzer that determines the root and the morphological pattern for each extracted stem. The last phase produces an annotated text where each word is tagged with its morphological attributes. The preprocessor presented in this paper is capable of dealing with vowelized and un vowelized words, and provides the input words along with relevant linguistics information needed by different applications. It is designed to be used with different NLP applications such as machine translation, text summarization, text correction, information retrieval, and automatic vowelization of Arabic text

Arabic Text Preprocessing for the Natural Language Processing Applications PDF

Author(s): Arafat Awajan

Keywords

Abstract

Address

CONTACT HOURS

CONTACT US

POPULAR RESOURCES