Home | About | Projects
Topic and Thematic Analysis of OEWG Statements
This report details the methodology and findings of a thematic and sentiment analysis conducted on the national statements delivered during the sessions of the Open-Ended Working Group (OEWG) on the Security of and in the Use of Information and Communication Technologies (2021–2025). The source of this exercise can be found here.
Data preparation
This exercise uses a dataset of national statements from the OEWG on developments in the field of information and telecommunications in the context of international security.
Raw transcripts were taken directly from the OEWG website, and were then processed into structured CSV format. Each row in this CSV format represents a single national intervention or statement turn. Non-textual artefacts, such as headers or timestamps, were cleaned prior to analysis.
To preprocess the data, the following steps were taken:
Normalization by converting all text to lowercase and stripping punctuation, digits, URLs and formatting noise.
Collapse redundant spaces.
Clean symbols and stopwords, by removing special characters to retain only alphanumeric tokens.
Filter statements with non-empty bodies to move forward for analysis.
I created a secondary cleaning pipeline to create two variants of each text column (body_clean and body) so that different analytical steps can be tested later.
Thematic analysis
To ensure both an exhaustive but precise thematic analysis, I decided to employ both a Non-negative Matrix Factorization (NMF) modelling thematic analysis along with a manual thematic analysis. I wanted to use unsupervised discovery on top of a manual thematic analysis to uncover any potential hidden patterns to ensure reduced bias that could help with any oversight that may happen. However, manual analysis is still maintained and perhaps more important here, due to the qualitative and research-based nature of this task.
NMF Modelling Thematic Analysis
Topic extraction was performed using an NMF model on Term Frequency-Inverse Document Frequency features. I used skikit-learn's TfidfVectorizer and NMF implementations:
Vectorization parameters:
Max_features = 40,000
Min_df = 2 (ignore words in fewer than two documents)
Max_df = 0.7 (ignore overly common words)
Ngram_range = (1,2) to capture bigrams
Model Parameters:
N_components = 12 (topics)
Init = 'nndsvd’
Max_iter = 600
Random_state = 42 for reproducibility
For each topic, the 15 highest-weighted words were extracted to aid interpretability. Each statement was assigned its dominant topic ID and corresponding topic confidence score. The NMF results are then visualized using a topic distribution bar chart (fig. 1) to show counts of statements per NMF topic and word clouds (fig. 2) to show the top keywords per topic.
Based on the word clouds as shown in fig. 2, unsupervised discovery proved to be insufficient in this case. This could be due to a few reasons, one of the most probable being that the text taken in this exercise is very homogenous. Considering it is UN text, the algorithm may have not been able to find strong strong signals, as the transcripts are stylistically uniform.
Fig. 1
Fig. 2
Manual Thematic Analysis (keyword scorer)
To complement the machine-derived topics, a rule-based keyword scoring system was implemented to capture normative and procedural themes discussed by the listed states. Seven thematic rules were predefined:
• Confidence Building
• International Law
• Threat Environment
• Cyber Capacity Building
• Rules, norms and principles of responsible state behaviour in cyberspace
• Stakeholder Involvement
• Dialogue Mechanism
Each theme was associated with a curated keyword list and compiled into case-insensitive regular expressions. Statements were scored by counting keyword matches for each theme. A function selected up to two top-scoring themes per statement.
Results were visualized using matplotlib and seaborn including a heatmap (fig. 3) to illustrate the frequency of thematic mentions by country and bar plots (fig. 4 and 5) to show the overall frequency of themes and top themes by country.
Fig. 3
Fig. 4
Fig. 5
The overall output of the analysis can be accessed here.
Theme correlation analysis
To identify how thematic discussions intersected across national statements, each document was coded for the presence of the seven predefined thematic categories. A binary matrix was constructed to indicate whether each theme appeared in a given statement. Pairwise correlations were then computed using Pearson's coefficient to capture how frequently themes co-occured within the same text (Fig. 6)
Fig. 6
The resulting correlation matrix and heatmap reveal clusters of related themes. For example, there's strong positive correlations between International Law and Rules/Norms/Principles.
Sentiment Analysis
The purpose of this analysis was to examine the tone and polarity of national statements delivered to the OEWG sessions. Sentiment analysis was used to determine whether the language of each statement reflected positive, neutral, or negative sentiment toward the issues under discussion.
A sentiment classification was performed using the VADER model, implemented through the nltk.sentiment.vader module. VADER is a lexicon- and rule-based model designed to to detect positive, negative and neutral tones to text. It assigns four scores to each text segment:
Neg - proportion of negative tone
Neu - proportion of neutral tone
Pos - proportion of positive tone
Compound - an aggregated score ranging from -1 (most negative) to +1 (most positive)
However, like the results in the unsupervised thematic analysis, there seems to be very little to take away from this sentiment analysis. Again, it is suspected that it is due to the lack of lexical diversity in the dataset, given that it is a UN transcribed document, therefore most of the language is formatted to be standardized to the point that the text appears to be uniform.