Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit

Vitiugin, F., Lee, S., Paakki, H., Chizhikova, A., & Sawhney, N. (2024). Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on RedditarXiv preprint arXiv:2406.08633.

Abstract

The surge in global migration patterns underscores the imperative of integrating migrants seamlessly into host communities, necessitating inclusive and trustworthy public services. Despite the Nordic countries’ robust public sector infrastructure, recent immigrants often encounter barriers to accessing these services, exacerbating social disparities and eroding trust. Addressing digital inequalities and linguistic diversity is paramount in this endeavor. This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit. We present Ensemble Learning for Multilingual Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions. Leveraging ensemble learning techniques for combining multiple tokenizers’ outputs and pre-trained language models, ELMICT demonstrates high performance (with F1 more than 0.95) in identifying code-mixing across various languages and contexts, particularly in cross-lingual zero-shot conditions (with avg. F1 more than 0.70). Moreover, the utilization of ELMICT helps to analyze the prevalence of code-mixing in migration-related threads compared to other thematic categories on Reddit, shedding light on the topics of concern to migrant communities. Our findings reveal insights into the communicative strategies employed by migrants on social media platforms, offering implications for the development of inclusive digital public services and conversational systems. By addressing the research questions posed in this study, we contribute to the understanding of linguistic diversity in migration discourse and pave the way for more effective tools for building trust in multicultural societies.

Read the publication here

More information

Even though the Nordic countries have strong public service systems, many recent immigrants still face challenges when trying to use them. These challenges (such as language barriers or lack of digital access) can lead to feelings of exclusion and mistrust.

This study looks at how migrants communicate online, especially on platforms like Reddit, where people often mix languages in a single message, a practice known as code-mixing. To better understand this, we developed a new tool called ELMICT (Ensemble Learning for Multilingual Identification of Code-mixed Texts). It uses advanced AI techniques to automatically detect when people are mixing languages in their posts.

ELMICT performs very well, helping us understand how often code-mixing occurs in conversations about migration and which topics matter most to migrant communities. This knowledge can help governments and organizations design more inclusive digital services that truly meet the needs of diverse communities.

Scroll to Top