SemEval 2020 Task 9 </br>SentiMix: Sentiment Analysis for Code-Mixed Social Media Text

Important Dates

* All deadlines are calculated at 11:59 pm
UTC-12 hours

Trial Data Ready	~~Jul 31 (Wed), 2019~~
Training Data Ready	~~Sep 4 (Wed), 2019~~
Test Data Ready	~~Feb 19 (Wed), 2020~~
Evaluation Start	~~Feb 19 (Wed), 2020~~
Evaluation End	~~Mar 11 (Wed), 2020~~
Results Posted	~~Mar 18 (Wed), 2020~~
System Description Paper Submission Due	~~May 1 (Fri), 2020~~
Task Description Paper Submission Due	~~May 8 (Fri), 2020~~
Notification to Authors	~~Jun 24 (Wed), 2020~~
Camera-ready Due	Jul 8 (Wed), 2020
Workshop	12-13 December 2020

* The Overview paper is now available at arXiv.

* Results are posted. Check the following links.

Hindi-English (Hinglish)

Spanish-English (Spanglish)

* Dear Participants, please use the following bibtex to cite the overview paper.

@inproceedings{patwa2020sentimix,
  title={SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets},
  author={Patwa, Parth and
          Aguilar, Gustavo and
          Kar, Sudipta and
          Pandey, Suraj and
          PYKL, Srinivas and
          Gamb{\"a}ck, Bj{\"o}rn and
          Chakraborty, Tanmoy and
          Solorio, Thamar and  
          Das, Amitava},
  booktitle = {Proceedings of the 14th International Workshop on Semantic Evaluation ({S}em{E}val-2020)},
  year = {2020},
  month = {December},
  address = {Barcelona, Spain},
  publisher = {Association for Computational Linguistics}
}

Welcome

Mixing languages, also known as code-mixing, is a norm in multilingual societies. Multilingual people, who are non-native English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language. In addition to mixing languages at the sentence level, it is fairly common to find the code-mixing behavior at the word level. This linguistic phenomenon poses a great challenge to conventional NLP systems, which currently rely on monolingual resources to handle the combination of multiple languages. The objective of this proposal is to bring the attention of the research community towards the task of sentiment analysis in code-mixed social media text. Specifically, we focus on the combination of English with Spanish (Spanglish) and Hindi (Hinglish), which are the 3rd and 4th most spoken languages in the world respectively.

Hinglish and Spanglish - the Modern Urban Languages

The evolution of social media texts such as blogs, micro-blogs (e.g., Twitter), and chats (e.g., WhatsApp and Facebook messa- ges) has created many new opportunities for information access and language technology, but it has also posed many new challenges making it one of the current prime research areas. Although current language technologies are primarily built for English, non-native English speakers combine English and other languages when they use social media. In fact, statistics show that half of the messages on Twitter are in a language other than English. This evidence suggests that other languages, including multilinguiality and code-mixing, need to be considered by the NLP community. Code-mixing poses several unseen difficulties to NLP tasks such as word-level language identification, part-of-speech tagging, dependency parsing, machine translation and semantic proces- sing. Conventional NLP systems heavily rely on monolingual resources to address code-mixed text, which limit them to properly handle issues like English-based phonetic typing, word-level code-mixing, and others. The next two phrases are examples of code-mixing in Spanglish and Hinglish. For the Spanglish example, in addition to the code-mixing at the sentence level, the word pushes conjugates the English word push accor- ding to the grammar rules in Spanish, which shows that code-mixing can also happen at the word level. Better to add more details on the Hinglish example In the Hinglish example only one English word enjoy has been used, but more noticeably for the Hindi words - instead of using Devanagari script, English phonetic typing is a popular practice in India.

The SentiMix task - A summary

The task is to predict the sentiment of a given code-mixed tweet. The sentiment labels are positive, negative, or neutral, and the code-mixed languages will be English-Hindi and English-Spanish. Besides the sentiment labels, we will also provide the language labels at the word level. The word-level language tags are en (En- glish), spa (Spanish), hi (Hindi), mixed, and univ (e.g., symbols, @ mentions, hashtags). Efficiency will be measured in terms of Precision, Recall, and F-measure.

SemEval 2020 Task 9 SentiMix: Sentiment Analysis for Code-Mixed Social Media Text

Important Dates

Welcome

Hinglish and Spanglish - the Modern Urban Languages

The SentiMix task - A summary

SemEval 2020 Task 9
SentiMix: Sentiment Analysis for Code-Mixed Social Media Text