Melbourne, Australia

Third Workshop on Computational Approaches to Linguistic Code-switching

The 3rd Workshop will be collocated with ACL 2018 on July 19. Melbourne, Australia.

Workshop Dates


  • Apr 30: Long and short paper submission deadline
  • May 14: Acceptance notification
  • May 21: Camera ready
  • Jul 19: Workshop

Shared Task Dates


  • Feb 8: Training data release
  • Mar 23: Test phase starts
  • Apr 6 Apr 19: Test phase ends (deadline extended)
  • May 4: System description paper
  • May 14: Author feedback
  • May 21: Camera ready
*All deadlines are at 23:59 GMT -08:00

Introduction


Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the intersentential, intrasentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies such as Parsing, Machine Translation (MT), Automatic Speech Recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance degrades at a rate proportional to the amount and level of the mixed-language present.

This workshop aims to bring together researchers interested in solving the problem and increase community awareness of the possible viable solutions to reduce the complexity of the phenomenon. The workshop invites contributions from researchers working in NLP approaches for the analysis and processing of mixed-language data especially with a focus on intrasentential code-switching. Topics of relevance to the workshop will include the following:

  • Development of linguistic resources to support research on code-switched data
  • NLP approaches for language identification in code-switched data
  • NLP approaches for named entity recognition in code-switched data
  • NLP techniques for the syntactic analysis of code-switched data
  • Domain/dialect/genre adaptation techniques applied to code-switched data processing
  • Language modeling approaches to code-switched data processing
  • Crowdsourcing approaches for the annotation of code-switched data
  • Machine translation approaches for code-switched data
  • Position papers discussing the challenges of code-switched data to NLP techniques
  • Methods for improving ASR in code switched data
  • Survey papers of NLP research for code-switched data
  • Sociolinguistic aspects of code-switching
  • Sociopragmatic aspects of code-switching

Invited Speakers


Notice: More speakers to be added.
Pascale Fung    Hong Kong University of Science & Technology

Shared Task: Named Entity Recognition on Code-switched Data


In this occasion we organize a Named Entity Recognition (NER) shared task in CS data with the purpose of providing even more resources to the community. The goal is to allow participants to explore the use of supervised, semi-supervised and/or unsupervised approaches to predict the entity types of CS data. We will release the training and development data for tuning and testing systems in the following language pairs: Spanish-English (SPA-ENG), and Modern Standard Arabic-Egyptian (MSA-EGY). We will use Twitter data for both languages.

Participants for this shared task will be required to submit output of their systems within a pre specified time window in order to qualify for evaluation in the shared task. They will also be required to submit a paper describing their system.

Entity Types

  • Person 
  • Location 
  • Organization 
  • Group 
  • Title 
  • Product 
  • Event 
  • Time 
  • Other 

Fill the registration form to participate in the shared task.

Updates will be given through the workshop Google group: codeswitching_workshop@googlegroups.com, and the Twitter account: @WCALCS. Direct updates will be sent by email to the participants based on the information provided in the registration form.


Data Release

For both MSA-EGY and ENG-SPA tweets, we provide packages that retrieve, tokenize, and synchronize the NE types of the training and development data: MSA-EGY Package and ENG-SPA Package. Instructions on how to use the packages are included. Additionally, the data has been tagged using the IOB scheme along with the listed entity types above.

For English-Spanish data (guidelines here):

  • Training set (50,757 tweets): train_offset.tsv
  • Development set (832 tweets): dev_offset.tsv
  • Testing set: the test set has been sent to the registered participants.

For Modern Standard Arabic-Egyptian data (guidelines here):


Task Details

The languages pairs ENG-SPA and MSA-EGY are independent tasks. Although we highly encourage submissions on both pairs, participants can choose from one or both languages.

We have opened the competitions in CodaLab already. Please follow the links below and request access to the competitions. More instructions for the competitions are provided in the links:

We also provide the baseline scores on each competition. Here's the description: NER baseline.

NOTE: Participants can use any resources (e.g., pre-trained word embeddings, gazetteers, etc.) that they consider appropriate for the task. In terms of the competition, there is no difference between with or without resources. However, we highly encourage participants to keep track of the perfomance when adding resources to include such insights in their papers.


Evaluation

We are going to evaluate your output predictions with the harmonic mean F-1 metric. This is the standard way to evaluate a NER task. Additionally, we include the Surface Forms F-1 metric introduced in the Workshop on Noisy User-generated Text, W-NUT 2017 (Derczynski et al., 2017). You can download the evaluation package here (instructions included).


Results

Results of the ENG-SPA competition are on the left and MSA-EGY on the right:

Team F1 Score
IIT BHU 63.7628
CAiRE++ 62.7608
FAIR 62.6671
Linguists 62.1307
Flytxt 59.2501
semantic 56.7205
BATs 54.1612
Fraunhofer FKIE 53.6514
Baseline 53.2802
Team F1 Score
FAIR 71.6154
GHHT 70.0938
Linguists 67.4419
CAiRE++ 66.0410
BATs 65.6207
semantic 65.0276
Baseline 62.7084

BibTex

Please cite the shared task paper with the following BibTex:

@inproceedings{calcs2018shtask,
    title={{Overview of the CALCS 2018 Shared Task: Named Entity Recognition on Code-switched Data}},
    author={Aguilar, Gustavo and AlGhamdi, Fahad and Soto, Victor and Diab, Mona and Hirschberg, Julia and Solorio, Thamar},
    publisher = {Association for Computational Linguistics},
    month={July},
    year={2018},
    address={Melbourne, Australia},
    booktitle = {Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching}
}

Paper Submissions


Authors are invited to submit papers describing original, unpublished work in the topic areas listed above. Full papers should not exceed eight pages. Additionally, authors are invited to submit short papers not exceeding 4 pages. Short papers usually describe:

  • a small, focused contribution;
  • work in progress;
  • a negative result;
  • an opinion piece; or
  • an interesting application nugget.

All papers can have up to 2 pages of references. All submissions must be in PDF format and must conform to the official ACL 2018 style guidelines:

The reviewing process will be blind and papers should not include the authors' names and affiliations. Each submission will be reviewed by at least three members of the program committee. Accepted papers will be published in the workshop proceedings and available at the ACL Anthology.

Multiple Submission Policy. Papers that have been or will be submitted to other meetings or publications are acceptable, but authors must indicate this information at submission time. If accepted, authors must notify the organizers as to whether the paper will be presented at the workshop or elsewhere.

Papers should be submitted electronically at https://www.softconf.com/acl2018/CALCS.

Program Committee


Elabbas Benmamoun    Duke University
Agnes Bolonyia    NC State University
Monojit Choudhury    Microsoft Research India
Barbara Bullock    University of Texas at Austin
Suzanne Dikker    New York University
Raymond Mooney    University of Texas at Austin
Chilin Shih    University of Illinois at Urbana-Champaign
Jacqueline Toribio    University of Texas at Austin
Constantine Lignos    University of Southern California Information Sciences Institute
Cecilia Montes-Alcala    Georgia Institute of Technology
Mitchell P. Marcus    University of Pennsylvania
Yves Scherrer    University of Helsinki
Björn Gambäck    Norwegian Universities of Science and Technology
Borja Navarro Colorado    Universidad de Alicante
Younes Samih    Heinrich Heine - Universität Düsseldorf
David Vilares    Universidad de Coruña
Ozlem Cetinoglu    Universität Stuttgart
Emre Yilmaz    CLS/CLST, Radboud University Nijmegen
Kalika Bali    Microsoft Research India
David Suendermann    Educational Testing Service
Alan W Black    Carnegie Mellon University

Organizers


Gustavo Aguilar (contact person)
Ph.D. Student
Department of Computer Science
University of Houston
Fahad AlGhamdi (contact person)
Ph.D. Student
Department of Computer Science
George Washington University
Victor Soto (contact person)
Ph.D. Student
Department of Computer Science
Columbia University
Thamar Solorio
Professor
Department of Computer Science
University of Houston
Mona Diab
Professor
Department of Computer Science
George Washington University
Julia Hirschberg
Professor
Department of Computer Science
Columbia University