Interview with Nizar Habash, linguist and computer engineer. Nizar will be a speaker at the Autumn School of the IMA Language Centre dedicated to the didactics of Arabic. The Autumn School will be held from 21 to 25 October 2024.
While human languages are, at their core, structured and systematic, much like programming languages, human languages introduce much more ambiguity, idiosyncrasy, richness, and subtlety. These challenges impact both human-computer and human-human communication. Computational linguistics (aka natural language processing), which sits at the intersection of linguistics and computer science, focuses on developing algorithms and models that enable machines to process and generate human language to support human-computer and human-human communication: from speech recognition to machine translation, and text generation. In the context of language learning and teaching, computational linguistics can help develop not only systems, but also insights that support educators and learners.
In the context of language learning and teaching, computational linguistics can help develop not only systems, but also insights that support educators and learners.
Readability refers to how easily a reader can understand a piece of written text. Factors such as sentence length, vocabulary complexity, and text structure all play a role. Readability assessment is the systematic process to determine the suitability of a text for a specific audience or reading level. For example, readability assessment tools can help ensure that educational materials match students' comprehension abilities, or that public documents are accessible to a wide audience. Systems of readability assessment can be used to alert human editors or teachers to readability issues. Together with text generation paraphrasing models, readability assessment models allow us to control the level of text rewriting, i.e. provide simpler (or even more complex) vocabulary and structure as needed.
I will present two specific projects on Arabic Readability – SAMER and BAREC -- and place them in the context of the larger goals of my lab, the Computational Approaches to Modeling Language (CAMeL) Lab at New York University Abu Dhabi. CAMeL Lab focuses on developing state of the art open-source tools and data sets to support Arabic natural language processing: http://www.camel-lab.com/.
SAMER (Simplification of Arabic Masterpieces for Extensive Reading) was a project co-led with Prof. Muhamed Al Khalil and funded by a New York University Abu Dhabi (NYUAD) Research Enhancement Fund. The main objective of SAMER was to create standards and tools for the simplification of modern fiction in Arabic to school-age learners. The project contributions include: (a) designing a five-level prototypical readability scale, (b) developing a 36k-word Readability-leveled Thesaurus for Arabic, (c) creating a simplification interface platform as an extension to Google Docs, and (d) constructing a 160K word three-level parallel graded corpus, a first of its kind, that maps text from Arabic fictional masterpieces to easier readability levels. All these resources are publicly available: http://samer.camel-lab.com/.
BAREC (Balanced Arabic Readability Evaluation Corpus) is an ongoing project, co-led with Prof. Hanada Taha (Zayed University) and funded by the Abu Dhabi Arabic Language Centre. In contrast to SAMER, which focused on modeling readability at the word level, and the Arabi21/Taha effort which focused on the book level, BAREC focuses on sentence level readability assessment on a 19-level scale inspired by Arabi21/Taha. BAREC goals include (a) the curation of a 10 million words that encompasses diverse genres, topics, and countries of origin, with a particular focus on readability levels, (b) the annotation of 1 million word subset manually for readability levels, and (c) developing artificial intelligence (AI) tools to assist content creators in assessing the readability levels of their materials based on specific target audiences. All these resources will be publicly available: http://barec.camel-lab.com/.
Arabic is an excellent language for computational processing, mainly because it combines many challenges and successfully modeling it has consequences to over 400 million speakers.
Arabic is an excellent language for computational processing, mainly because it combines many challenges and successfully modeling it has consequences to over 400 million speakers. Arabic challenges include its orthographic ambiguity due to elided diacritics, its morphological richness that include templatic and concatenative processes and numerous features leading to a very large number of forms per lexical entry, its dialectal variation across space (geographical dialects) and time (historical forms that are still being used), and the high degree of variability and noise in Arabic as used on daily basis including code-switching with other languages, high degree of spelling variants, and even the use of scripts other than Arabic. None of these individual issues are unique to Arabic particularly, but their coexistence makes processing Arabic more complex and more interesting, with possible benefits for other languages, too.
For anyone attending the Autumn School, I would say: come with curiosity and openness to learn and share your experience with others working on the teaching and learning Arabic. I am excited to share insights and ideas from working in Arabic computational linguistics and I look forward to learning more from others and their perspectives.
اللسانيات المحوسبة التي تتموضع في تقاطع بين الألسنيات والعلوم الحاسوبية، تركز على تطوير خوارزميات ونماذج تتيح للآلات أن تعمل لتوليد تراكيب اللغة البشرية، ومساعدة التواصل بين الآلة والإنسان من جهة وبين الإنسان والإنسان من جهة أخرى
العربية لغة ممتازة للمعالجة، لأنها بشكل أساسي تشكل تحديات كثيرة ولأن لنمذجتها نتائج تعود على أكثر من 400 مليون متحدث
أودّ أن أقول لكل من يرغب بحضور المدرسة الخريفية، تعالوا بفضول وانفتاح للتعلم ومشاركة تجربتكم مع آخرين يعلمون ويتعلمون العربية, أنا متحمس لمشاركة إضاءات وأفكار من عملي في المعالجة الحوسبية للغة العربية وأتطلع للتعلم من الآخرين أكثر ومن منظورهم.
Bienvenue sur le blog de l’Institut du monde arabe, lancé en octobre 2016.
Son but : donner la parole aux passionnés du monde arabe dans et hors de l’institution. Retrouvez les coups de cœur des équipes (livres, cinéma, musique, expos…), les portraits de personnalités, les regards d’intervenants sur des questions historiques, sociologiques, artistiques… Promenez-vous dans les coulisses de l’institution et des événements et suivez les actualités de la présidence de l’IMA.
La rédaction invite experts et amateurs, tous passionnés, à s’exprimer en leur nom sur toutes les facettes du monde arabe.
Jack Lang, Président de l'Institut du monde arabe, présente les ambitions du tout nouveau blog...
Lire la suitePour recevoir toute l'actualité de l'Institut du monde arabe sur les sujets qui vous intéressent
Je m'inscris