We have centralized many code-switching datasets, including the data from the CALCS series, into a single code-switching benchmark. Please consider using the improved version of the data and the public leaderboards available here: ritual.uh.edu/lince
Workshop Dates
- Apr 30: Long and short paper submission deadline
- May 14: Acceptance notification
- May 21: Camera ready
- Jul 19: Workshop
Shared Task Dates
- Feb 8: Training data release
- Mar 23: Test phase starts
Apr 6Apr 19: Test phase ends (deadline extended)- May 4: System description paper
- May 14: Author feedback
- May 21: Camera ready
Introduction
Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the intersentential, intrasentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies such as Parsing, Machine Translation (MT), Automatic Speech Recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance degrades at a rate proportional to the amount and level of the mixed-language present.
This workshop aims to bring together researchers interested in solving the problem and increase community awareness of the possible viable solutions to reduce the complexity of the phenomenon. The workshop invites contributions from researchers working in NLP approaches for the analysis and processing of mixed-language data especially with a focus on intrasentential code-switching. Topics of relevance to the workshop will include the following:
- Development of linguistic resources to support research on code-switched data
- NLP approaches for language identification in code-switched data
- NLP approaches for named entity recognition in code-switched data
- NLP techniques for the syntactic analysis of code-switched data
- Domain/dialect/genre adaptation techniques applied to code-switched data processing
- Language modeling approaches to code-switched data processing
- Crowdsourcing approaches for the annotation of code-switched data
- Machine translation approaches for code-switched data
- Position papers discussing the challenges of code-switched data to NLP techniques
- Methods for improving ASR in code switched data
- Survey papers of NLP research for code-switched data
- Sociolinguistic aspects of code-switching
- Sociopragmatic aspects of code-switching
Invited Speakers
With the explosive growth of online and social media activities and the dominance of English as the internet language, code-switching (CW) between a native matrix language to the embedded (often in English) language has become a relevant phenomenon. However, the lack of code-switching training data has long been the bottleneck to successful recognition and understanding of code-switching speech. Meanwhile, there are many parallel bilingual data for machine translation system training. In this talk, I will give an overview of both a traditional statistical approach to CW language modelling incorporating linguistic structures, to a neural network end-to-end approach which attempts to learn these structures. I will present how we can leverage parallel bilingual data and a small amount of code-switch data for language modelling. We believe that learning how to code-switch is a more promising approach as more code-switching data becomes available in the future.
Pascale Fung is a Professor at the Department of Electronic & Computer Engineering and Department of Computer Science & Engineeringat The Hong Kong University of Science & Technology (HKUST). She is an elected Fellow of the Institute of Electrical and Electronic Engineers (IEEE) for her “contributions to human-machine interactions”, and an elected Fellow of the International Speech Communication Association for “fundamental contributions to the interdisciplinary area of spoken language human-machine interactions”. She is the Director of HKUST Center for AI Research (CAiRE), an interdisciplinary research center on top of all four schools at HKUST. She co-founded the Human Language Technology Center (HLTC). She is an affiliated faculty with the Robotics Institute and the Big Data Institute at HKUST. She is the founding chair of the Women Faculty Association at HKUST. She is an expert on the Global Future Council on AI and Robotics, a think tank for the World Economic Forum, and blogs for the online WEF publication Agenda. She represents HKUST on Partnership on AI to Benefit People and Society
Linguistic variation is pervasive at all levels of representation. From differences in syntactic constituency down to subtle, sub-phonemic modulations in articulation, languages as well as individual speakers can vary widely in terms of how they express a given meaning. Codeswitching between languages adds an additional layer to an already highly complex situation, in that the combinatorial rules from multiple grammars are in play and can influence the speech (or text) that is ultimately produced.
Importantly, however, codeswitching behavior is not random. Rather, like all other forms of language, codeswitching displays certain rules and tendencies that are predictable as a function of the languages being spoken and the speakers producing them. In this talk, I lay out the psycholinguistic principles thought to influence the patterns that can be observed in codeswitched language, drawing examples from sociolinguistically- and psycholinguistically-oriented corpus studies, as well as controlled experimental paradigms. The story that emerges is one of wide but principled variation: I argue that, while much remains to be done before we fully understand why speakers produce the variants that they do, the problem is a tractable one, and will continue to see progress as long as careful experimentation and psycholinguistically informed corpus investigation are brought to bear.
Melinda Fricke is an Assistant Professor in the Department of Linguistics at the University of Pittsburgh. She received her Ph.D. in Linguistics from UC Berkeley in 2013, conducted her postdoctoral work at the Center for Language Science in the Psychology Department at Penn State University from 2013 to 2016, and began her current position in fall 2016. Her work combines corpus and experimental methods to study the psycholinguistics of speech production and speech perception, often focusing on phonetic variation in bilingual speakers.
Paper Submissions
Authors are invited to submit papers describing original, unpublished work in the topic areas listed above. Full papers should not exceed eight pages. Additionally, authors are invited to submit short papers not exceeding 4 pages. Short papers usually describe:
- a small, focused contribution;
- work in progress;
- a negative result;
- an opinion piece; or
- an interesting application nugget.
All papers can have up to 2 pages of references. All submissions must be in PDF format and must conform to the official ACL 2018 style guidelines:
- ACL Author Guidelines (see Paper Submission and Templates for templates).
The reviewing process will be blind and papers should not include the authors' names and affiliations. Each submission will be reviewed by at least three members of the program committee. Accepted papers will be published in the workshop proceedings and available at the ACL Anthology.
Multiple Submission Policy. Papers that have been or will be submitted to other meetings or publications are acceptable, but authors must indicate this information at submission time. If accepted, authors must notify the organizers as to whether the paper will be presented at the workshop or elsewhere.
Papers should be submitted electronically at https://www.softconf.com/acl2018/CALCS.