CS Workshop

We have centralized many code-switching datasets, including the data from the CALCS series, into a single code-switching benchmark. Please consider using the improved version of the data and the public leaderboards available here: ritual.uh.edu/lince

Workshop Dates

Apr 30: Long and short paper submission deadline
May 14: Acceptance notification
May 21: Camera ready
Jul 19: Workshop

Shared Task Dates

Feb 8: Training data release
Mar 23: Test phase starts
~~Apr 6~~ Apr 19: Test phase ends (deadline extended)
May 4: System description paper
May 14: Author feedback
May 21: Camera ready

*All deadlines are at 23:59 GMT -08:00

Introduction

Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the intersentential, intrasentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies such as Parsing, Machine Translation (MT), Automatic Speech Recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance degrades at a rate proportional to the amount and level of the mixed-language present.

This workshop aims to bring together researchers interested in solving the problem and increase community awareness of the possible viable solutions to reduce the complexity of the phenomenon. The workshop invites contributions from researchers working in NLP approaches for the analysis and processing of mixed-language data especially with a focus on intrasentential code-switching. Topics of relevance to the workshop will include the following:

Development of linguistic resources to support research on code-switched data
NLP approaches for language identification in code-switched data
NLP approaches for named entity recognition in code-switched data
NLP techniques for the syntactic analysis of code-switched data
Domain/dialect/genre adaptation techniques applied to code-switched data processing
Language modeling approaches to code-switched data processing
Crowdsourcing approaches for the annotation of code-switched data
Machine translation approaches for code-switched data
Position papers discussing the challenges of code-switched data to NLP techniques
Methods for improving ASR in code switched data
Survey papers of NLP research for code-switched data
Sociolinguistic aspects of code-switching
Sociopragmatic aspects of code-switching

Invited Speakers

Pascale Fung Hong Kong University of Science & Technology

Learning to Code-Switch

Abstract

With the explosive growth of online and social media activities and the dominance of English as the internet language, code-switching (CW) between a native matrix language to the embedded (often in English) language has become a relevant phenomenon. However, the lack of code-switching training data has long been the bottleneck to successful recognition and understanding of code-switching speech. Meanwhile, there are many parallel bilingual data for machine translation system training. In this talk, I will give an overview of both a traditional statistical approach to CW language modelling incorporating linguistic structures, to a neural network end-to-end approach which attempts to learn these structures. I will present how we can leverage parallel bilingual data and a small amount of code-switch data for language modelling. We believe that learning how to code-switch is a more promising approach as more code-switching data becomes available in the future.

Short Bio

Pascale Fung is a Professor at the Department of Electronic & Computer Engineering and Department of Computer Science & Engineeringat The Hong Kong University of Science & Technology (HKUST). She is an elected Fellow of the Institute of Electrical and Electronic Engineers (IEEE) for her “contributions to human-machine interactions”, and an elected Fellow of the International Speech Communication Association for “fundamental contributions to the interdisciplinary area of spoken language human-machine interactions”. She is the Director of HKUST Center for AI Research (CAiRE), an interdisciplinary research center on top of all four schools at HKUST. She co-founded the Human Language Technology Center (HLTC). She is an affiliated faculty with the Robotics Institute and the Big Data Institute at HKUST. She is the founding chair of the Women Faculty Association at HKUST. She is an expert on the Global Future Council on AI and Robotics, a think tank for the World Economic Forum, and blogs for the online WEF publication Agenda. She represents HKUST on Partnership on AI to Benefit People and Society

Melinda Fricke University of Pittsburgh

Variation in codeswitched language: a psycholinguistic approach to what, when, and why

Abstract

Linguistic variation is pervasive at all levels of representation. From differences in syntactic constituency down to subtle, sub-phonemic modulations in articulation, languages as well as individual speakers can vary widely in terms of how they express a given meaning. Codeswitching between languages adds an additional layer to an already highly complex situation, in that the combinatorial rules from multiple grammars are in play and can influence the speech (or text) that is ultimately produced.

Importantly, however, codeswitching behavior is not random. Rather, like all other forms of language, codeswitching displays certain rules and tendencies that are predictable as a function of the languages being spoken and the speakers producing them. In this talk, I lay out the psycholinguistic principles thought to influence the patterns that can be observed in codeswitched language, drawing examples from sociolinguistically- and psycholinguistically-oriented corpus studies, as well as controlled experimental paradigms. The story that emerges is one of wide but principled variation: I argue that, while much remains to be done before we fully understand why speakers produce the variants that they do, the problem is a tractable one, and will continue to see progress as long as careful experimentation and psycholinguistically informed corpus investigation are brought to bear.

Short Bio

Melinda Fricke is an Assistant Professor in the Department of Linguistics at the University of Pittsburgh. She received her Ph.D. in Linguistics from UC Berkeley in 2013, conducted her postdoctoral work at the Center for Language Science in the Psychology Department at Penn State University from 2013 to 2016, and began her current position in fall 2016. Her work combines corpus and experimental methods to study the psycholinguistics of speech production and speech perception, often focusing on phonetic variation in bilingual speakers.

Shared Task: Named Entity Recognition on Code-switched Data

In this occasion we organize a Named Entity Recognition (NER) shared task in CS data with the purpose of providing even more resources to the community. The goal is to allow participants to explore the use of supervised, semi-supervised and/or unsupervised approaches to predict the entity types of CS data. We will release the training and development data for tuning and testing systems in the following language pairs: Spanish-English (SPA-ENG), and Modern Standard Arabic-Egyptian (MSA-EGY). We will use Twitter data for both languages.

Participants for this shared task will be required to submit output of their systems within a pre specified time window in order to qualify for evaluation in the shared task. They will also be required to submit a paper describing their system.

Entity Types

Person
Location
Organization
Group
Title
Product
Event
Time
Other

Fill the registration form to participate in the shared task.

Updates will be given through the workshop Google group: codeswitching_workshop@googlegroups.com, and the Twitter account: @WCALCS. Direct updates will be sent by email to the participants based on the information provided in the registration form.

Data Release

For both MSA-EGY and ENG-SPA tweets, we provide packages that retrieve, tokenize, and synchronize the NE types of the training and development data: MSA-EGY Package and ENG-SPA Package. Instructions on how to use the packages are included. Additionally, the data has been tagged using the IOB scheme along with the listed entity types above.

For English-Spanish data (guidelines here):

Training set (50,757 tweets): train_offset.tsv
Development set (832 tweets): dev_offset.tsv
Testing set: the test set has been sent to the registered participants.

For Modern Standard Arabic-Egyptian data (guidelines here):

Training set (10,102 tweets): Train-MSA-EGY-2018.tsv
Development set (1,122 tweets): Dev-MSA-EGY-2018.tsv
Testing set: the test set has been sent to the registered participants.

Task Details

The languages pairs ENG-SPA and MSA-EGY are independent tasks. Although we highly encourage submissions on both pairs, participants can choose from one or both languages.

We have opened the competitions in CodaLab already. Please follow the links below and request access to the competitions. More instructions for the competitions are provided in the links:

We also provide the baseline scores on each competition. Here's the description: NER baseline.

NOTE: Participants can use any resources (e.g., pre-trained word embeddings, gazetteers, etc.) that they consider appropriate for the task. In terms of the competition, there is no difference between with or without resources. However, we highly encourage participants to keep track of the perfomance when adding resources to include such insights in their papers.

Evaluation

We are going to evaluate your output predictions with the harmonic mean F-1 metric. This is the standard way to evaluate a NER task. Additionally, we include the Surface Forms F-1 metric introduced in the Workshop on Noisy User-generated Text, W-NUT 2017 (Derczynski et al., 2017). You can download the evaluation package here (instructions included).

Results

Results of the ENG-SPA competition are on the left and MSA-EGY on the right:

Team	F1 Score
IIT BHU	63.7628
CAiRE++	62.7608
FAIR	62.6671
Linguists	62.1307
Flytxt	59.2501
semantic	56.7205
BATs	54.1612
Fraunhofer FKIE	53.6514
Baseline	53.2802

Team	F1 Score
FAIR	71.6154
GHHT	70.0938
Linguists	67.4419
CAiRE++	66.0410
BATs	65.6207
semantic	65.0276
Baseline	62.7084

BibTex

Please cite the shared task paper with the following BibTex:

@inproceedings{calcs2018shtask,
    title={{Overview of the CALCS 2018 Shared Task: Named Entity Recognition on Code-switched Data}},
    author={Aguilar, Gustavo and AlGhamdi, Fahad and Soto, Victor and Diab, Mona and Hirschberg, Julia and Solorio, Thamar},
    publisher = {Association for Computational Linguistics},
    month={July},
    year={2018},
    address={Melbourne, Australia},
    booktitle = {Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching}
}

Paper Submissions

Authors are invited to submit papers describing original, unpublished work in the topic areas listed above. Full papers should not exceed eight pages. Additionally, authors are invited to submit short papers not exceeding 4 pages. Short papers usually describe:

a small, focused contribution;
work in progress;
a negative result;
an opinion piece; or
an interesting application nugget.

All papers can have up to 2 pages of references. All submissions must be in PDF format and must conform to the official ACL 2018 style guidelines:

ACL Author Guidelines (see Paper Submission and Templates for templates).

The reviewing process will be blind and papers should not include the authors' names and affiliations. Each submission will be reviewed by at least three members of the program committee. Accepted papers will be published in the workshop proceedings and available at the ACL Anthology.

Multiple Submission Policy. Papers that have been or will be submitted to other meetings or publications are acceptable, but authors must indicate this information at submission time. If accepted, authors must notify the organizers as to whether the paper will be presented at the workshop or elsewhere.

Papers should be submitted electronically at https://www.softconf.com/acl2018/CALCS.

Program Committee

Elabbas Benmamoun Duke University

Agnes Bolonyia NC State University

Monojit Choudhury Microsoft Research India

Barbara Bullock University of Texas at Austin

Suzanne Dikker New York University

Raymond Mooney University of Texas at Austin

Chilin Shih University of Illinois at Urbana-Champaign

Jacqueline Toribio University of Texas at Austin

Constantine Lignos University of Southern California Information Sciences Institute

Cecilia Montes-Alcala Georgia Institute of Technology

Mitchell P. Marcus University of Pennsylvania

Yves Scherrer University of Helsinki

Björn Gambäck Norwegian Universities of Science and Technology

Borja Navarro Colorado Universidad de Alicante

Younes Samih Heinrich Heine - Universität Düsseldorf

David Vilares Universidad de Coruña

Ozlem Cetinoglu Universität Stuttgart

Emre Yilmaz CLS/CLST, Radboud University Nijmegen

Kalika Bali Microsoft Research India

David Suendermann Educational Testing Service

Alan W Black Carnegie Mellon University