Arabic Language Disambiguation for Natural Language Processing ApplicationsTechnology #cu14012
Questions about this technology? Ask a Technology Manager
Processing the Arabic language into its elementary linguistic components is particularly challenging due to two factors: the morphology of Arabic (how Arabic words are put together) is complex, and Arabic orthography (how Arabic is written) is highly ambiguous, since Arabic is typically spelled without short vowels and other diacritical markers. Additionally, Arabic has many dialects that vary from Standard Arabic (formal language used in education and mainstream print media) in terms of morphology and lexicon, and have no standard writing systems. MADAMIRA ,is a software suite for morphological analysis and disambiguation of Arabic and its dialects. MADAMIRA can perform several types of linguistic analyses on raw Arabic text as required for natural language processing (NLP). The MADAMIRA toolkit uses machine-learning algorithms and established Arabic morphological analyzers to report the linguistic features of each word in context. Downstream NLP products and tools may then use the analyses yielded from MADAMIRA for further work.
The MADAMIRA software suite extracts and clarifies the linguistic information needed to support accurate Arabic natural language processing
MADAMIRA provides linguistic information such as tokenization, diacritization, lemmatization, part-of-speech tagging, full morphological tagging, base phrase chunking and named entity recognition for each Arabic word received as input. MADAMIRA users can then use this information to create the analysis best suited for their application. The high accuracy of MADAMIRA in correctly predicting linguistic features of Arabic words has been demonstrated at the Center for Computational Learning Systems at Columbia University.
Lead Inventors (alphabetical order):
- Part-of-Speech tagging
- Morphological disambiguation for full range of morphological features
- Named entity recognition
- Base phrase chunking
- Single software package capable of performing several natural language processing tasks
- Unbiased natural language processing for Arabic, providing flexibility for developers building applications requiring NLP
- High accuracy in predicting linguistic features such as part-of-speech, lemmas, diacritics, and tokenization
Available for licensing and sponsored research support
Tech Ventures Reference: IR CU14012
Selected Related Publications:
N. Habash, R. Roth, O. Rambow, R. Eskander and N. Tomeh. Morphological Analysis and Disambiguation for Dialectal Arabic. In Proceedings of Conference of the North American Association for Computational Linguistics (NAACL), Atlanta, Georgia, 2013.
M. Diab. Second Generation Tools (AMIRA 2.0): Fast and Robust Tokenization, POS tagging, and Base Phrase Chunking. MEDAR 2nd International Conference on Arabic Language Resources and Tools, April, Cairo, Egypt, 2009
N. Habash, O. Rambow, and R. Roth. MADA+TOKAN, A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization. Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, 2009
R. Roth, O. Rambow, N. Habash, M. Diab, and C. Rudin. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of Association for Computational Linguistics (ACL), Columbus, Ohio. 2008.
- M. Diab, K. Hacioglu, and D. Jurafsky. Automated Methods for Processing Arabic Text: From Tokenization to Base Phrase Chunking. In Arabic Computational Morphology: Knowledge-based and Empirical Methods. Editors Antal van den Bosch and Abdelhadi Soudi. Kluwer/Springer Publications. 2007
- N. Habash and O. Rambow. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the Conference of American Association for Computational Linguistics (ACL), 2005.