Melbourne, Australia

Third Workshop on Computational Approaches to Linguistic Code-switching

The 3rd Workshop will be collocated with ACL 2018 on July 19. Melbourne, Australia.

Workshop Dates

  • Apr 30: Long and short paper submission deadline
  • May 14: Acceptance notification
  • May 21: Camera ready
  • Jul 19: Workshop

Shared Task Dates

  • Feb 8: Training data release
  • Mar 23: Test phase starts
  • Apr 6: Test phase ends
  • May 4: System description paper
  • May 14: Author feedback
  • May 21: Camera ready


Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the intersentential, intrasentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies such as Parsing, Machine Translation (MT), Automatic Speech Recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance degrades at a rate proportional to the amount and level of the mixed-language present.

This workshop aims to bring together researchers interested in solving the problem and increase community awareness of the possible viable solutions to reduce the complexity of the phenomenon. The workshop invites contributions from researchers working in NLP approaches for the analysis and processing of mixed-language data especially with a focus on intrasentential code-switching. Topics of relevance to the workshop will include the following:

  • Development of linguistic resources to support research on code-switched data
  • NLP approaches for language identification in code-switched data
  • NLP approaches for named entity recognition in code-switched data
  • NLP techniques for the syntactic analysis of code-switched data
  • Domain/dialect/genre adaptation techniques applied to code-switched data processing
  • Language modeling approaches to code-switched data processing
  • Crowdsourcing approaches for the annotation of code-switched data
  • Machine translation approaches for code-switched data
  • Position papers discussing the challenges of code-switched data to NLP techniques
  • Methods for improving ASR in code switched data
  • Survey papers of NLP research for code-switched data
  • Sociolinguistic aspects of code-switching
  • Sociopragmatic aspects of code-switching

Invited Speakers

Notice: More speakers to be added.
Pascale Fung    Hong Kong University of Science & Technology

Shared Task: Named Entity Recognition on Code-switched Data

In this occasion we organize a Named Entity Recognition (NER) shared task in CS data with the purpose of providing even more resources to the community. The goal is to allow participants to explore the use of supervised, semi-supervised and/or unsupervised approaches to predict the entity types of CS data. We will release the gold standard data for tuning and testing systems in the following language pairs: Spanish-English (SPA-ENG), and Modern Standard Arabic-Egyptian (MSA-EGY). We will use Twitter data for both languages.

Participants for this shared task will be required to submit output of their systems within a pre specified time window in order to qualify for evaluation in the shared task. They will also be required to submit a paper describing their system.

Entity Types

  • Person 
  • Location 
  • Organization 
  • Group 
  • Title 
  • Product 
  • Event 
  • Time 
  • Other 

Fill the registration form to participate in the shared task.

Updates will be given through the workshop Google group:, and the Twitter account: @WCALCS. Direct updates will be sent by email to the participants based on the information provided in the registration form.

Data Release

We will be sending directly the ENG-SPA data to the registered participants because some of the development tweets are not available to fetch anymore. If you have registered to the shared task already but haven't received the data, please don't hesitate to send us an email.

For both MSA-EGY and ENG-SPA tweets, we provide packages that retrieve, tokenize, and synchronize the NE types of the training and development data: MSA-EGY Package and ENG-SPA Package. Instructions on how to use the packages are included. Additionally, the data has been tagged using the IOB scheme along with the listed entity types above.

For English-Spanish data (guidelines here):

For Modern Standard Arabic-Egyptian data (guidelines here):

Task Details

The languages pairs ENG-SPA and MSA-EGY are independent tasks. Although we highly encourage submissions on both pairs, participants can choose from one or both languages.

When the test phase starts, we will open a competition in CodaLab. The shared task will be devided two competitions, one for ENG-SPA and the other for MSA-EGY. We will add more information when the test phase date is closer.

NOTE: Participants can use any resources (e.g., pre-trained word embeddings, gazetteers, etc.) that they consider appropriate for the task. In terms of the competition, there is no difference between with or without resources. However, we highly encourage participants to keep track of the perfomance when adding resources to include such insights in the paper.


We are going to evaluate your output predictions with the harmonic mean F-1 metric. This is the standard way to evaluate a NER task. Additionally, we include the Surface Forms F-1 metric introduced in the Workshop on Noisy User-generated Text, W-NUT 2017 (Derczynski et al., 2017). You can download the evaluation package here (instructions included).

Program Committee

Elabbas Benmamoun    Duke University
Agnes Bolonyia    NC State University
Monojit Choudhury    Microsoft Research India
Barbara Bullock    University of Texas at Austin
Suzanne Dikker    New York University
Raymond Mooney    University of Texas at Austin
Chilin Shih    University of Illinois at Urbana-Champaign
Jacqueline Toribio    University of Texas at Austin
Rabih Zbib    BBN Technologies
Constantine Lignos    University of Southern California Information Sciences Institute
Cecilia Montes-Alcala    Georgia Institute of Technology
Mitchell P. Marcus    University of Pennsylvania
Yves Scherrer    University of Helsinki
Björn Gambäck    Norwegian Universities of Science and Technology
Borja Navarro Colorado    Universidad de Alicante
Younes Samih    Heinrich Heine - Universität Düsseldorf
David Vilares    Universidad de Coruña
Ozlem Cetinoglu    Universität Stuttgart
Emre Yilmaz    CLS/CLST, Radboud University Nijmegen
Kalika Bali    Microsoft Research India
David Suendermann    Educational Testing Service
Alan W Black    Carnegie Mellon University


Gustavo Aguilar (contact person)
Ph.D. Student
Department of Computer Science
University of Houston
Fahad AlGhamdi (contact person)
Ph.D. Student
Department of Computer Science
George Washington University
Victor Soto (contact person)
Ph.D. Student
Department of Computer Science
Columbia University
Thamar Solorio
Department of Computer Science
University of Houston
Mona Diab
Department of Computer Science
George Washington University
Julia Hirschberg
Department of Computer Science
Columbia University