CS Workshop

First Call for Papers

This edition will be the seventh edition of the workshop collocated with NAACL 2025.

Bilingual and multilingual speakers often engage in code-switching (CS), mixing languages within a conversation, influenced by cultural nuances. CS can occur at inter-sentential, intra-sentential, and morphological levels, posing challenges for language understanding and generation. Models trained for a single language often struggle with mixed-language input. Despite advances in multilingual pre-trained language models (LMs), they may still perform poorly on CS data. Research on LMs' ability to process CS data, considering cultural nuances, reasoning, coverage, and performance biases, remains underexplored.

As CS becomes more common in informal communication like newsgroups, tweets, and social media, research on LMs processing mixed-language data is urgently needed. This workshop aims to unite researchers working on spoken and written CS technologies, promoting collaboration to improve AI's handling of CS across diverse linguistic contexts.

Topics of Interest

The workshop will invite contributions from researchers working in NLP and speech approaches for the analysis and processing of mixed-language data. Topics of relevance to the workshop include the following:

Development of data and model resources to support research on CS data
New data augmentation techniques for improving robustness on CS data
New approaches for NLP downstream tasks: question answering, conversational agents, named entity recognition, sentiment analysis, machine translation, language generation, and ASR in CS data
NLP techniques for the syntactic analysis of CS data
Domain, dialect, genre adaptation techniques applied to CS data processing
Language modeling approaches to CS data processing
Sociolinguistic and/or sociopragmatic aspects of CS
Techniques and metrics for automatically evaluating synthetically generated CS text
Utilization of LLMs and assessment of their performance on NLP tasks for CS data
Survey and position papers discussing the challenges of CS data to NLP techniques
Ethical issues and consideration on CS applications.

Submissions

The workshop accepts three categories of papers: regular workshop papers, non-archival and cross-submissions. Only regular workshop papers will be included in the proceedings as archival publications. All three categories of papers may be long (maximum 8 pages plus references) or short (maximum 4 pages plus references), with unlimited additional pages for references, following the ARR formatting requirements. The reported research should be substantially original. Accepted papers will be presented as posters and orals. Reviewing will be double-blind, and thus, no author information should be included in the papers; self-reference that identifies the authors should be avoided or anonymized. Accepted regular workshop papers will appear in the workshop proceedings. We welcome papers with a maximum of 2 pages for non-archival submission. Please send us an email if you are submitting the non-archival submission. The limitation section is optional and will not be counted in the page limit. The submission portal is open on OpenReview.

Shared Task on Automatic Evaluation for Code-Switched Text Generation

This shared task focuses on developing automatic evaluation metrics for code-switched (CS) text generation. Participants are tasked with creating systems that can accurately assess the quality of synthetically generated CS text, considering both fluency and accuracy. This is crucial because:

Scarcity of CS Data: CS text data is limited, making automatic generation vital for data augmentation and improving model performance.
Growing Demand: The need for CS text is increasing, particularly in dialogue systems and chatbots, to enable more natural and inclusive interactions.
Lack of Robust Evaluation: Current methods for evaluating CS text are insufficient, hindering progress in this field.

This shared task aims to address this gap and drive further research in automatic evaluation metrics for CS text generation.

Languages Supported:

Public Leaderboard: English-Hindi, English-Tamil, English-Malayalam
Private Leaderboard: English-Indonesian, Indonesian-Javanese, Singlish (English-Chinese)

Metric:

Accuracy: Systems will be evaluated based on their accuracy in predicting human preferences for CS text. This will be measured by comparing the system's ranking of generated sentences (Sent 1 vs. Sent 2) with human annotations in the CSPref dataset.

Dataset:

The CSPref dataset will be used for this task. It contains:

Original L1: English sentences
Original L2: Hindi, Tamil, or Malayalam sentences
Sent 1, Sent 2: Two different CS generations based on the original sentences.
Chosen: Human annotation indicating the preferred sentence (Sent 1, Sent 2, or Tie).
Lang: Language pair

Data is available here: https://huggingface.co/datasets/garrykuwanto/cspref

Evaluation:

Systems will be ranked on a public leaderboard based on their accuracy in predicting human preferences on the English-Hindi, English-Tamil, and English-Malayalam language pairs.
A private leaderboard will evaluate system performance on unseen language pairs (English-Indonesian, Indonesian-Javanese, Singlish) to assess generalization ability.

Submission:

Participants will submit their system's predictions for each instance in the test set, indicating their preferred sentence (Sent 1, Sent 2, or Tie).

Goal:

The goal of this shared task is to encourage the development of robust and reliable automatic evaluation metrics for CS text generation, ultimately leading to more fluent and accurate CS language models.

Competition Page:

For more information about the competition, please visit the competition page: https://eval.ai/web/challenges/challenge-page/2437/overview or alternatively https://www.kaggle.com/competitions/calcs-shared-task-2025/overview

Program

	CALCS @ NAACL 2025 Saturday, May 3, 2025 (Mountain Daylight Time, UTC-06:00)
09:00 - 10:30	Morning Session I
09:00 - 09:05	Welcome Remarks Genta Indra Winata
09:05 - 09:50	Invited Talk: The History and Evolution of Code-Switching Evaluation in NLP Sunayana Sitaram
09:50 - 10:35	Invited Talk: Code-Switching Thought Patterns in Multilingual Language Models Alham Fikri Aji
10:35 - 11:00	*Coffee Break*
11:00 - 12:30	Morning Session II
11:15 - 11:30	LexiLogic@CALCS 2025: Predicting Preferences in Generated Code-Switched Text Pranav Gupta, Souvik Bhattacharyya, Niranjan kumar M, Billodal Roy
11:30 - 11:45	The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR Injy Hamed, Thang Vu, Nizar Habash
11:45 - 12:00	Where and How Do Languages Mix? A Study of Spanish-Guaraní Code-Switching in Paraguay Olga Kellert, Nemika Tyagi
12:00 - 13:30	*Lunch Break*
13:30 - 15:30	Afternoon Session I
13:30 - 14:15	Invited Talk: Code-Mixing as Social Strategy: Identity, Accommodation, and Implications for Chatbots Monojit Choudhury
14:15 - 15:00	Panel Discussion: Multilingual LLMs & Code-Switching Ruochen Zhang, Monojit Choudhury, Genta Indra Winata, David Ifeoluwa Adelani
15:00 - 15:15	Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding Haneul Yoo, Yongjin Yang, Hwaran Lee
15:15 - 15:30	Code-Switching Curriculum Learning for Multilingual Transfer in LLMs Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, Hwaran Lee
15:30 - 16:00	*Coffee Break*
16:00 - 17:00	Afternoon Session II
16:00 - 16:15	Beyond Monolingual Limits: Fine-Tuning Monolingual ASR for Yoruba-English Code-Switching Oreoluwa Boluwatife Babatunde, Victor Tolulope Olufemi, Emmanuel Bolarinwa, Kausar Yetunde Moshood, Chris Chinenye Emezue
16:15 - 16:30	EuskañolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching Maite Heredia, Jeremy Barnes, Aitor Soroa
16:30 - 16:45	Tongue-Tied: Breaking LLMs Safety Through New Language Learning Bibek Upadhayay, Vahid Behzadan
16:45 - 16:50	Closing Remarks Marina Zhukova

Important Dates

Paper Submission

Workshop submission deadline (regular and non-archival submissions): 13 March 2025
Notification of acceptance: 17 March 2025
Camera ready papers due: 24 March 2025
Workshop date: 3 May 2025