Feature Extraction Of Protest Demonstration On Lihkg Discussion Forum

Ao Shen: Hi, I’m [inaudible] Ao Shen from the University of Hong Kong. This research is done by Dr. Dr K P Chow and me, and our title is Feature Extraction of Protest Demonstration on the Lihkg Discussion Forum.  Here is the Agenda of this presentation.

We will first overview the Lihkg data set then talk about the motivation of this research. Then it is the experiment multi-label classification. Finally, we’ll have a conclusion and talk about the future work. 

In today’s cyber space many cyber crimes leaves traces in social media platforms, such as Facebook and online discussion forum and potential criminal activities may be posted in social media. In this way the online platforms are the places for detecting potential crimes and obtaining traces. The Lihkg discussion forum is a well-known multi-category discussion forum in Hong Kong. Some activity organizers will post notification messages in advance and their supporters or regular users can immediately read, participate, and post their comments. 

From August, 2019 most of discussions on the forum refers to public gathering and demonstrations. The demonstrations in Hong Kong over these past months have astonished the world, but whether a particular activity constitutes a criminal offense depends on the local laws and regulations, but it is still important in police resource management to identify the features of an unregistered demonstration. So this study established a model to automatically identify the features of the demonstrations from public posts. 

We divided our experiment into four parts: data collection and labeling, the post vectorization, multi-label classification model and finally, it’s the experiment result. The first step is data collection. We collect the post on the Current Affairs section from 2019 August to October. Most of the posts are in traditional Chinese or Cantonese and the centers in this bracket is the English translation for this presentation. 

About data labeling, totally there are seven labels: strikes, arson, wounding, sit-ins, parades riots and conflict, and we’ll randomly select 1,272 posts to label the data like this. Like the table shown here. 

The second part is the post vectorization. We first use a Doc2Vector master to transfer the textual post to vector metrics, and then use the Pearson’s correlation coefficient to calculate the coefficient metrics to find the relationship between the post and the seven labels. And then this matrix is the input of the multi-label classification model. For the classification model we use MLP neural network to build the model. It is a class of the feedforward artificial neural network and I show the structure and the information of the model here. The accuracy precision recall, and F1 score as shown in this slide. And from the weighted average value here, we can see that the MLP has a good performance and reliable classifying to the corresponding labels.

I think the label data is unbalanced. It’s still difficult to identify some minority features. We can also output some high correlation topics appear in each class. So based on these topics, it is feasible to describe the subject and understand the urgency of the corresponding demonstration.

To make a conclusion, the Floyd demo demonstration will not only disrupt the business and traffic, but also affect social security and threaten the social order. We use MLP multi-label classification methods to understand the specific forms and the characteristic of the harmful demonstration on online forums so that it can provide the theoretical support and applications and posts to assign resources and have a proper monitoring of the activity schedule. The future research will attempt to study the relationship between the subjects and discourses on a deeper level and that’s the end of this presentation. And thank you for listening.

Leave a Comment