Melbourne, Australia

Third Workshop on Computational Approaches to Linguistic Code-switching

The 3rd Workshop will be collocated with ACL 2018 on July 19. Melbourne, Australia.

We have centralized many code-switching datasets, including the data from the CALCS series, into a single code-switching benchmark. Please consider using the improved version of the data and the public leaderboards available here: ritual.uh.edu/lince

Workshop Dates


  • Apr 30: Long and short paper submission deadline
  • May 14: Acceptance notification
  • May 21: Camera ready
  • Jul 19: Workshop

Shared Task Dates


  • Feb 8: Training data release
  • Mar 23: Test phase starts
  • Apr 6 Apr 19: Test phase ends (deadline extended)
  • May 4: System description paper
  • May 14: Author feedback
  • May 21: Camera ready
*All deadlines are at 23:59 GMT -08:00
This browser does not support PDFs. Please download the PDF to view it: Download PDF.

Introduction


Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the intersentential, intrasentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies such as Parsing, Machine Translation (MT), Automatic Speech Recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance degrades at a rate proportional to the amount and level of the mixed-language present.

This workshop aims to bring together researchers interested in solving the problem and increase community awareness of the possible viable solutions to reduce the complexity of the phenomenon. The workshop invites contributions from researchers working in NLP approaches for the analysis and processing of mixed-language data especially with a focus on intrasentential code-switching. Topics of relevance to the workshop will include the following:

  • Development of linguistic resources to support research on code-switched data
  • NLP approaches for language identification in code-switched data
  • NLP approaches for named entity recognition in code-switched data
  • NLP techniques for the syntactic analysis of code-switched data
  • Domain/dialect/genre adaptation techniques applied to code-switched data processing
  • Language modeling approaches to code-switched data processing
  • Crowdsourcing approaches for the annotation of code-switched data
  • Machine translation approaches for code-switched data
  • Position papers discussing the challenges of code-switched data to NLP techniques
  • Methods for improving ASR in code switched data
  • Survey papers of NLP research for code-switched data
  • Sociolinguistic aspects of code-switching
  • Sociopragmatic aspects of code-switching

Invited Speakers


Pascale Fung    Hong Kong University of Science & Technology
Learning to Code-Switch
Abstract

With the explosive growth of online and social media activities and the dominance of English as the internet language, code-switching (CW) between a native matrix language to the embedded (often in English) language has become a relevant phenomenon. However, the lack of code-switching training data has long been the bottleneck to successful recognition and understanding of code-switching speech. Meanwhile, there are many parallel bilingual data for machine translation system training. In this talk, I will give an overview of both a traditional statistical approach to CW language modelling incorporating linguistic structures, to a neural network end-to-end approach which attempts to learn these structures. I will present how we can leverage parallel bilingual data and a small amount of code-switch data for language modelling. We believe that learning how to code-switch is a more promising approach as more code-switching data becomes available in the future.

Short Bio

Pascale Fung is a Professor at the Department of Electronic & Computer Engineering and Department of Computer Science & Engineeringat The Hong Kong University of Science & Technology (HKUST). She is an elected Fellow of the Institute of Electrical and Electronic Engineers (IEEE) for her “contributions to human-machine interactions”, and an elected Fellow of the International Speech Communication Association for “fundamental contributions to the interdisciplinary area of spoken language human-machine interactions”. She is the Director of HKUST Center for AI Research (CAiRE), an interdisciplinary research center on top of all four schools at HKUST. She co-founded the Human Language Technology Center (HLTC). She is an affiliated faculty with the Robotics Institute and the Big Data Institute at HKUST. She is the founding chair of the Women Faculty Association at HKUST. She is an expert on the Global Future Council on AI and Robotics, a think tank for the World Economic Forum, and blogs for the online WEF publication Agenda. She represents HKUST on Partnership on AI to Benefit People and Society

Melinda Fricke    University of Pittsburgh
Variation in codeswitched language: a psycholinguistic approach to what, when, and why
Abstract

Linguistic variation is pervasive at all levels of representation. From differences in syntactic constituency down to subtle, sub-phonemic modulations in articulation, languages as well as individual speakers can vary widely in terms of how they express a given meaning. Codeswitching between languages adds an additional layer to an already highly complex situation, in that the combinatorial rules from multiple grammars are in play and can influence the speech (or text) that is ultimately produced.

Importantly, however, codeswitching behavior is not random. Rather, like all other forms of language, codeswitching displays certain rules and tendencies that are predictable as a function of the languages being spoken and the speakers producing them. In this talk, I lay out the psycholinguistic principles thought to influence the patterns that can be observed in codeswitched language, drawing examples from sociolinguistically- and psycholinguistically-oriented corpus studies, as well as controlled experimental paradigms. The story that emerges is one of wide but principled variation: I argue that, while much remains to be done before we fully understand why speakers produce the variants that they do, the problem is a tractable one, and will continue to see progress as long as careful experimentation and psycholinguistically informed corpus investigation are brought to bear.

Short Bio

Melinda Fricke is an Assistant Professor in the Department of Linguistics at the University of Pittsburgh. She received her Ph.D. in Linguistics from UC Berkeley in 2013, conducted her postdoctoral work at the Center for Language Science in the Psychology Department at Penn State University from 2013 to 2016, and began her current position in fall 2016. Her work combines corpus and experimental methods to study the psycholinguistics of speech production and speech perception, often focusing on phonetic variation in bilingual speakers.

Shared Task: Named Entity Recognition on Code-switched Data


In this occasion we organize a Named Entity Recognition (NER) shared task in CS data with the purpose of providing even more resources to the community. The goal is to allow participants to explore the use of supervised, semi-supervised and/or unsupervised approaches to predict the entity types of CS data. We will release the training and development data for tuning and testing systems in the following language pairs: Spanish-English (SPA-ENG), and Modern Standard Arabic-Egyptian (MSA-EGY). We will use Twitter data for both languages.

Participants for this shared task will be required to submit output of their systems within a pre specified time window in order to qualify for evaluation in the shared task. They will also be required to submit a paper describing their system.

Entity Types

  • Person 
  • Location 
  • Organization 
  • Group 
  • Title 
  • Product 
  • Event 
  • Time 
  • Other 

Fill the registration form to participate in the shared task.

Updates will be given through the workshop Google group: codeswitching_workshop@googlegroups.com, and the Twitter account: @WCALCS. Direct updates will be sent by email to the participants based on the information provided in the registration form.


Data Release

For both MSA-EGY and ENG-SPA tweets, we provide packages that retrieve, tokenize, and synchronize the NE types of the training and development data: MSA-EGY Package and ENG-SPA Package. Instructions on how to use the packages are included. Additionally, the data has been tagged using the IOB scheme along with the listed entity types above.

For English-Spanish data (guidelines here):

  • Training set (50,757 tweets): train_offset.tsv
  • Development set (832 tweets): dev_offset.tsv
  • Testing set: the test set has been sent to the registered participants.

For Modern Standard Arabic-Egyptian data (guidelines here):


Task Details

The languages pairs ENG-SPA and MSA-EGY are independent tasks. Although we highly encourage submissions on both pairs, participants can choose from one or both languages.

We have opened the competitions in CodaLab already. Please follow the links below and request access to the competitions. More instructions for the competitions are provided in the links:

We also provide the baseline scores on each competition. Here's the description: NER baseline.

NOTE: Participants can use any resources (e.g., pre-trained word embeddings, gazetteers, etc.) that they consider appropriate for the task. In terms of the competition, there is no difference between with or without resources. However, we highly encourage participants to keep track of the perfomance when adding resources to include such insights in their papers.


Evaluation

We are going to evaluate your output predictions with the harmonic mean F-1 metric. This is the standard way to evaluate a NER task. Additionally, we include the Surface Forms F-1 metric introduced in the Workshop on Noisy User-generated Text, W-NUT 2017 (Derczynski et al., 2017). You can download the evaluation package here (instructions included).


Results

Results of the ENG-SPA competition are on the left and MSA-EGY on the right:

Team F1 Score
IIT BHU 63.7628
CAiRE++ 62.7608
FAIR 62.6671
Linguists 62.1307
Flytxt 59.2501
semantic 56.7205
BATs 54.1612
Fraunhofer FKIE 53.6514
Baseline 53.2802
Team F1 Score
FAIR 71.6154
GHHT 70.0938
Linguists 67.4419
CAiRE++ 66.0410
BATs 65.6207
semantic 65.0276
Baseline 62.7084

BibTex

Please cite the shared task paper with the following BibTex:

@inproceedings{calcs2018shtask,
    title={{Overview of the CALCS 2018 Shared Task: Named Entity Recognition on Code-switched Data}},
    author={Aguilar, Gustavo and AlGhamdi, Fahad and Soto, Victor and Diab, Mona and Hirschberg, Julia and Solorio, Thamar},
    publisher = {Association for Computational Linguistics},
    month={July},
    year={2018},
    address={Melbourne, Australia},
    booktitle = {Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching}
}

Paper Submissions


Authors are invited to submit papers describing original, unpublished work in the topic areas listed above. Full papers should not exceed eight pages. Additionally, authors are invited to submit short papers not exceeding 4 pages. Short papers usually describe:

  • a small, focused contribution;
  • work in progress;
  • a negative result;
  • an opinion piece; or
  • an interesting application nugget.

All papers can have up to 2 pages of references. All submissions must be in PDF format and must conform to the official ACL 2018 style guidelines:

The reviewing process will be blind and papers should not include the authors' names and affiliations. Each submission will be reviewed by at least three members of the program committee. Accepted papers will be published in the workshop proceedings and available at the ACL Anthology.

Multiple Submission Policy. Papers that have been or will be submitted to other meetings or publications are acceptable, but authors must indicate this information at submission time. If accepted, authors must notify the organizers as to whether the paper will be presented at the workshop or elsewhere.

Papers should be submitted electronically at https://www.softconf.com/acl2018/CALCS.

Program Committee


Elabbas Benmamoun    Duke University
Agnes Bolonyia    NC State University
Monojit Choudhury    Microsoft Research India
Barbara Bullock    University of Texas at Austin
Suzanne Dikker    New York University
Raymond Mooney    University of Texas at Austin
Chilin Shih    University of Illinois at Urbana-Champaign
Jacqueline Toribio    University of Texas at Austin
Constantine Lignos    University of Southern California Information Sciences Institute
Cecilia Montes-Alcala    Georgia Institute of Technology
Mitchell P. Marcus    University of Pennsylvania
Yves Scherrer    University of Helsinki
Björn Gambäck    Norwegian Universities of Science and Technology
Borja Navarro Colorado    Universidad de Alicante
Younes Samih    Heinrich Heine - Universität Düsseldorf
David Vilares    Universidad de Coruña
Ozlem Cetinoglu    Universität Stuttgart
Emre Yilmaz    CLS/CLST, Radboud University Nijmegen
Kalika Bali    Microsoft Research India
David Suendermann    Educational Testing Service
Alan W Black    Carnegie Mellon University

Organizers


Gustavo Aguilar (contact person)
Ph.D. Student
Department of Computer Science
University of Houston
Fahad AlGhamdi (contact person)
Ph.D. Student
Department of Computer Science
George Washington University
Victor Soto (contact person)
Ph.D. Student
Department of Computer Science
Columbia University
Thamar Solorio
Professor
Department of Computer Science
University of Houston
Mona Diab
Professor
Department of Computer Science
George Washington University
Julia Hirschberg
Professor
Department of Computer Science
Columbia University