A study of Sentiment, Topic Distribution, and Use of Discourse Vocabulary Across Political Communities
View the Project on GitHub BornaKalhor/reddits-political-temper-check
Internet-based discourse in general and political discussions in particular have expanded rapidly both in size and significance and now hold a greater sway in shaping the public’s thoughts and opinions. Reddit, as one of the most popular discussion platforms, provides a unique window into how different political communities engage in discourse. Comprehending modern democratic dialogue relies on how well the characteristics of political communication across different ideological lines are understood and perceived.
To this end, we conducted a study to identify the main topics of discourse, sentiment, and to what extent evidence-based and discourse markers are prevalent in their political discussions.
Specifically, we aimed to:
The analysis used the corpus-webis-tldr-17.json dataset. However, due to hardware constraints, only the first 300,000 were studied, which included Reddit comments across various subreddits. Filtering by the list of political subreddits, we came to the final dataset of 8,781 comments from politically oriented discussions.
The data preprocessing consisted of several steps to prepare the data for the following steps. As the first step, unnecessary information such as URLs, Reddit mentions, and punctuation was removed. Leveraging WordNet’s lemmatizer, we reduced words to their base forms to ensure consistency and standardization. Additionally, we applied a minimum length requirement to only include meaningful text entries.
In an attempt to finding out what the main topic of discussion are, we performed a topic modelling. Before that, we needed to convert the textual data to vectors; this was accomplished by TF-IDF vectorization set to maximum of 1500 features and a minimum document frequency of 3 to trade-off computational efficiency and good representation of meaningful words, respectively. Although LDA was used to calculate perplexity scores as a supportive metric, we opted for NMF for finding the optimal number of topics (determined by cohesion score) and actual modeling due to potential better performance on shorter texts compared to LDA, and simplicity/computational efficiency compared to more advanced methods like BERTopic.
Taking insight from the topics discovered by topic modeling, we defined the following categories, and assigned one of these categories to each entry in the dataset based on its subreddit.
Note that even though topic analysis was based on content text, we classified entries in these categories based on their subreddit and not their text content. This is because we were interested in studying sentiment and language in different categories of political subreddits, not different categories of political content across all subreddits. We took this indirect approach as it was almost impossible to directly cluster and label subreddits based on their names effectively due to their short and often seemingly meaningless names. Essentially, not only did topic modelling help us with our goal of finding the main topic of discussion, but it also helped to classify over 600 subreddits into a few main categories.
We employed VADER (Valence Aware Dictionary and sEntiment Reasoner) to compute sentiment scores, which produced compound scores ranging from -1 (most negative) to +1 (most positive). VADER is a very popular lexicon- and rule-based sentiment analyzer. We favored this tool in spite of the advent of more recent and powerful transformer-based models, as VADER was specifically designed and tuned for analyzing sentiment in social media text, with the bonus of being much faster and much less hardware-demanding.
A comprehensive keyword analysis was conducted using 197 keywords across eight categories:
We measured both the average and density of using these keywords in the text of comments in each category.
Significant differences in sentiment were observed across political categories:
Political Category | Mean Sentiment | Standard Deviation | Sample Size |
---|---|---|---|
Job and Finance | 0.1968 | 0.7293 | 454 |
Election | 0.1610 | 0.7178 | 107 |
Center | 0.1485 | 0.8368 | 107 |
Law and Crime | 0.1099 | 0.7879 | 107 |
Other | 0.0864 | 0.7802 | 714 |
Geopolitics | 0.0488 | 0.7791 | 469 |
Right | 0.0411 | 0.7912 | 961 |
Left | 0.0322 | 0.7648 | 5,862 |
We also performed ANOVA results which confirmed statistically significant differences (F=3.7196, p=0.0005).
Eight distinct topics were identified through NMF modeling:
Political Category | Avg Content Length | Avg Word Count | Avg Keyword Count | Keyword Density |
---|---|---|---|---|
Center | 1,683 | 285 | 10.98 | 0.0428 |
Law and Crime | 1,482 | 258 | 9.09 | 0.0425 |
Other | 1,456 | 251 | 8.28 | 0.0387 |
Right | 1,325 | 227 | 7.94 | 0.0405 |
Job and Finance | 1,317 | 225 | 7.92 | 0.0407 |
Geopolitics | 1,281 | 219 | 7.63 | 0.0398 |
Left | 1,217 | 210 | 7.33 | 0.0400 |
Election | 1,014 | 178 | 6.08 | 0.0371 |
Notable correlations:
These do not indicate any causality.
The analysis reveals that discussions in Job and Finance-, election-, and center-related subreddits exhibit the most positive sentiment, while Left and right related subreddits show the lowest average sentiment in their communities. This may stem from the idea that economic discussions may be more solution-oriented, while ideological discussions may be more critical or negative in tone.
The identification of eight distinct topics demonstrates the span of political discourse on Reddit: from specific policy areas (taxation, gun policy, education) to broader institutional discussions (legal/constitutional, electoral politics). The prominence of economic and labor-related topics reflects their central role in contemporary online political discourse.
Center and Law and Crime categories demonstrate highest discourse vocabulary metrics, with longer content and higher keyword density. This suggests these discussions typically require more nuanced analysis (centrist positions, legal matters) and may naturally employ more sophisticated discourse patterns. Interestingly, while discussions in election related subreddits have a very high sentiment score (more positive in tone), they rank lowest in use of evidence-based vocabulary. Moreover, the lack of a significant correlation between keyword density and sentiment indicates that criticizing remarks on Reddit do not use more evidence-based vocabulary.
This computational analysis provides empirical evidence for systematic differences in political discourse across ideological orientations on Reddit.
Key findings:
This study revealed how political affiliation influences communication patterns in online environments, providing important considerations for research into the quality of civic discourse and ideological polarization phenomena. We also release the resulted dataset with added dimensions for other researchers to build on.
This report is still under development!