Code-Switching Patterns in Multilingual Social Media: A Corpus Study
Abstract
We analyze a 4.2-million-token corpus of multilingual Twitter data to characterize intra-sentential code-switching across English-Spanish, English-Arabic, and English-Mandarin pairs. Matrix language frame analysis reveals systematic asymmetries in embedding language selection correlated with topic domain and audience design. Political and sports topics show the highest switching rates (38% and 31% of tokens respectively).