Stock image of data through a lens

In a recently published chapter of the Oxford Handbook of the Sociology of Machine Learning, DPhil student Linda Hong Cheng proposes novel approaches and applications for decolonial Machine Learning and Natural Language Processing Methods in Chinese-Language Contexts.

The chapter, written by the Leverhulme Centre for Demographic Science’s Affiliate Member Linda Hong Cheng, is the first to review the work and innovations in the emerging field of Chinese computational sociology.

Study figure: Three Intersecting Fields That Form Chinese Computational Sociology
Study figure: Three intersecting fields that
form Chinese computational sociology

Chinese computational sociology is an area of study established by Linda, which she defines as ‘a subfield of sociological research within Chinese studies that applies Machine Learning (ML) and/or Natural Language Procession (NLP) methods to Chinese-language contexts, […] profoundly engaged with sociological theory and focused on understanding and evaluating the workings of Chinese society via […] computational approaches.’

A rise in ML and NLP methods have equipped social scientists with the necessary tools to collect previously inaccessible data on socio-political issues in China. Linda’s review reveals that existing work in Chinese computational sociology primarily focusses on contentious politics surrounding protest or state-society interactions, censorship, and information exchange. Textual and social media platform data are mostly employed as data sources, alongside an ever evolving combination of ML and NLP research around
four types of methodologies:

  1. Supervised learning and dictionary-based NLP methods: This method has predominantly been used to analyse large, unstructured data sources such as protest data on Weibo – one of China’s biggest social media platform – and how they contribute to the themes, mechanisms and media attention of protests within China.
  2. NLP and survey methods: Qualitative methods such as surveys are used to analyse the interactions and response of Chinese citizens to government policies in areas including censorship and public health such as COVID-19.
  3. Multimodal learning: Multiple modes of data such as text, video and image data are used to study the dissemination of protest and political information on social media platforms.
  4. Deep learning: This niche line of research uses deep learning and human annotation to analyse cross-national information dissemination such as information flows from Euro-America into Chinese social media platforms.

Whilst these innovative methods continue to shape ML and NLP in Chinese computational sociology, these methods remain inherently Eurocentric and are particularly challenging to apply to non-European languages. The review points to under-studied areas such as socioeconomic and spatial inequality, and organisational behaviour, as well as methodical limitations across existing literature. In particular, ML applications lack control for censorship surrounding social media data and the fact that NLP methods are not yet completely suitable for Chinese-language texts.

Linda Hong Cheng, Affiliate Member of the Leverhulme Centre for Demographic Science, said ‘The incompatibility of NLP methods with the Chinese language means that sociologists, demographers, and social scientists in general face significant hurdles in accurately analysing and interpreting textual data from Chinese sources.’

The review outlines a number of ways that improvements in ML and NLP technologies could resolve these limitations. Estimating how many posts have been censored based on specific themes, for example, could be used to create a control variable for building more robust ML and prediction models. Artificially adding spaces between Chinese terms, better detecting nuances in the Chinese language, and the expansion of open-source Chinese word banks are proposed for making ML and NLP applications less Eurocentric and more accessible for the Chinese language.

Linda Hong Cheng adds, ‘It is imperative that researchers continue to improve the technology’s effectiveness for analysing Chinese-language data, so that sociologists and social scientists can continue contributing to the increasingly influential field of Chinese computational sociology. At the broader level, NLP and ML methods are profoundly powerful research tools and should not be limited to Euro-American contexts—diversifying their applications are urgently needed for the progression of the social sciences as a whole.

Future trajectories for Chinese computational sociology include introducing computational application classes that include non-Western contexts such as China, addressing the socio-ethical implications of ML and NLP methods, and more research into various sociological themes.

The review concludes, ‘ML and NLP methods applications are located within a rich lineage of international sociological theory and empirical methodology—one that brings new opportunities for the grassroots-ification (普及化 puji hua) of data sources, research in non-Western contexts[…], and the strengthen[ing of] cross-national understanding.’