Multilingual speakers will often mix languages when they communicate with other multilingual speakers in what is usually known as code-switching (CSW). CSW is typically present on the intersentential, intrasentential and even morphological levels. CSW presents serious challenges for language technologies such as Machine Translation (MT), Automatic Speech Recognition (ASR), language generation (LG), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Recent work has shown that even powerful multilingual models, such as multilingual BERT, yield subpar performance on CSW data (cf. Aguilar and Solorio, 2020).
Considering the ubiquitous nature of CSW in informal text communication such as newsgroups, tweets, blogs, and other social media, and the number of multilingual speakers worldwide that use these platforms, addressing the challenge of processing CSW data continues to be of great practical value. This workshop aims to bring together researchers interested in technology for mixed language data, in either spoken or written form, and increase community awareness of the different efforts developed to date in this space.
Topics of Interest
The workshop will invite contributions from researchers working in NLP and speech approaches for the analysis and processing of mixed-language data. Topics of relevance to the workshop will include the following:
- Development of linguistic resources to support research on code-switched data;
- NLP approaches for any of language identification/named entity recognition/sentiment analysis/machine translation/language generation in code-switched data;
- NLP techniques for the syntactic analysis of code-switched data;
- Domain/dialect/genre adaptation techniques applied to code-switched data processing;
- Language modeling approaches to code-switched data processing;
- Crowdsourcing approaches for the annotation of code-switched data;
- Position papers discussing the challenges of code-switched data to NLP techniques;
- Methods for improving ASR in code switched data;
- Survey papers of NLP research for code-switched data;
- Sociolinguistic and/or sociopragmatic aspects of code-switching.
- Workshop submission deadline (long, short and special track): March 29th
Notification of acceptance: April 15th
- Notification of acceptance: April 19th
- Camera ready papers due: April 26th
- Workshop date: June 11th
All deadlines are 11.59 pm UTC -12h (“anywhere on Earth”).
In the past few years we have organized a series of shared tasks focusing primarily on enabling technology for code-switching, including language identification, part of speech tagging and named entity recognition. This year we are organizing a series of shared tasks involving machine translation for code-switching settings in multiple language combinations and directions.
In this task we provide gold standard data to train and evaluate MT models to take English as input and generate Hinglish data.
We provide raw data with no gold label translations. Participants are challenged to work on systems that can generate high quality translations in the pairs shown below. More language directions may be added soon:
- Spanish-English → English
- Spanish-English → Spanish
- English → Spanish-English
- English → Spanish
- Modern Standard Arabic-Egyptian Arabic → English
- Modern Standard Arabic-Egyptian Arabic → Spanish
- [Spanish-English → English]: I’m expecting dos camonietas llenas de rosas This weekend. → I’m expecting two trucks full of roses This weekend.
- [Spanish-English → Spanish]: Es viernes y el outfit lo sabe → Es viernes y el atuendo lo sabe
- [English → Spanish]: My goal is to move to my own apartment next year → Mi objetivo es mudarme a mi propio apartamento el próximo año
- [Spanish → English]: A mi manera o pa la calle!! → My way or the highway!!
We will use Linguistic Code-Switching Evaluation Benchmark. The leaderboard will rank systems based on BLUE scores. We also plan to do a smaller, human evaluation that will be presented at the workshop.
To access the data sets go here: Linguistic Code-Switching Evaluation Benchmark
[Update (03/29/2021)]: The usernames are removed from the datasets. Please download the newest version of datasets from Lince.
- Shared Task training data release: Feb 26th
Shared Task test phase: April 1st - 7th
- Shared Task test phase: April 19th - 25th
Shared Task System description papers due: April 15th
- Shared Task System description papers due: April 30th
Shared Task reviews back to authors: April 22nd
- Shared Task reviews back to authors: May 8th
- Shared Task Camera ready papers due: May 15th
Questions about the shared task can be sent to: firstname.lastname@example.org
- Authors are invited to submit papers describing original, unpublished work in the topic areas listed above. Long papers can contain up to eight pages with unlimited number of pages for references, while short papers can include up to four pages of content and unltimited pages for references.
- All submissions must be in PDF format and must comply with the official NAACL 2021 style guidelines: https://2021.naacl.org/calls/papers/#submission-types–requirements
The reviewing process will not be blind and papers can include the authors’ names and affiliations. Each submission will be reviewed by at least three members of the program committee. Accepted papers will be published in the workshop proceedings.
Papers that have been or will be submitted to other meetings or publications are acceptable, but authors must indicate this information at submission time. If accepted, authors must notify the organizers before the camera-ready deadline as to whether the paper will be presented at the workshop or elsewhere.
Papers should be submitted electronically at https://www.softconf.com/naacl2021/calcs2021
We also invite non-archival one page abstracts of recently published work highlighting the CSW research by young researchers or early career investigators. The goal is to help increase the visibility of PhD students, Postdocs and early career investigators (loosely defined) working in the space of language technology for CSW. Please note that you should use the anonymized template for submission and you can use unlimited number of pages for references.