ValueMonitor – Create your own topic model¶
This page is a visualisation of the ValueMonitor prototype. In case you would like to use the notebook, click on the icon ‘Run in Google Colab’ hereunder:
1. Import dataset and packages ¶
In this step, the dataset and relavant python packages are imported
''' Packages'''
!pip install corextopic
!pip install joblib
!pip install tabulate
!pip install simple_colors
import os, sys, importlib
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import pickle
''' Source code'''
user = "tristandewildt"
repo = "ValueMonitor_Prototype"
src_dir = "code"
pyfile_1 = "make_topic_model.py"
pyfile_2 = "create_visualisation.py"
if os.path.isdir(repo):
!rm -rf {repo}
!git clone https://github.com/{user}/{repo}.git
path = f"{repo}/{src_dir}"
if not path in sys.path:
sys.path.insert(1, path)
make_topic_model = importlib.import_module(pyfile_1.rstrip(".py"))
create_visualisation = importlib.import_module(pyfile_2.rstrip(".py"))
from make_topic_model import *
from create_visualisation import *
''' Datasets'''
!wget -q --show-progress --no-check-certificate 'https://docs.google.com/uc?export=download&id=12ZyryF8MbMYKuhIBEhUUvnvx43_cna56' -O dataset_ValueMonitor_prototype
with open('dataset_ValueMonitor_prototype', "rb") as fh:
df = pickle.load(fh)
'git' is not recognized as an internal or external command, operable program or batch file.
2. Creating the topic model ¶
In this step, we create a topic model in which some of the topics refer to values. The creation of topics that reflect values is done by means of so-called ‘anchor’ words. These words guide the algorithm in the creation of topics that reflect values.
Anchor words are typically words that people use to refer to (the idea of) a value, such as synonyms. After adding some anchor words and running the model, the algorithm will automatically pick up other words that refer to the value. This is because the algorithm has observed that these words are often mentionned in the same documents as the anchor words.
Finding the right anchor words is typically an iterative process, by observing the new topic model created by the algorithm. Some anchor words need to be added to ensure that some aspect of the value are not left behind (to be placed in dict_anchor_words in the cell below). Other words need to be removed since they do not refer to the value (in list_rejected_words in the cell below).
We have prefilled an number of anchor words for each value.
dict_anchor_words = {
"Justice and Fairness" : ["justice", "fairness", "fair", "equality", "unfair"],
"Privacy" : ["privacy", "personal data", "personal sphere", "data privacy", "privacy protection", "privacy concerns",
"confidentiality"],
"Cyber-security" : ["cyber", "security", "cybersecurity", "malicious", "attacks"],
"Environmnental Sustainability" : ["sustainability", "sustainable", "renewable", "durable", "durability",
"sustainable development", "environmental"],
"Transparency" : ["transparency", "transparent", "transparently", "explainability", "interpretability", "explainable",
"opaque", "interpretable"],
"Accountability" : ["accountable", "accountability", "accountable", "traceability", "traceable"],
"Autonomy" : ["autonomy", "self-determination", "autonomy human", "personal autonomy"],
"Democracy" : ["democracy", "democratic", "human rights", "freedom speech", "equal representation",
"political"],
"Reliability" : ["reliability", "reliable", "robustness", "robust", "predictability"],
"Trust" : ["trust", "trustworthy", "trustworthiness", "confidence", "honesty"],
"Well-being" : ["well being", "well-being", "wellbeing", "quality life",
"good life", "qol", "life satisfaction", "welfare"],
"Inclusiveness" : ["inclusiveness", "inclusive", "inclusivity", "discrimination", "diversity"]
}
list_rejected_words = ["iop", "iop publishing", "publishing ltd", "publishing", "licence iop",
"mdpi basel", "basel switzerland", "mdpi", "basel", "licensee mdpi", "licensee", "authors licensee",
"switzerland", "authors", "publishing limited", "emerald", "emerald publishing", ]
list_anchor_words_other_topics = [
["internet of things", "iot", "internet things", "iot devices", "things iot"],
["artificial intelligence", "ai", "artificial"],
]
number_of_topics_to_find = 100
number_of_documents_in_analysis = 2000
number_of_words_per_topic_to_show = 10
number_of_words_per_topic = 10
'''--------------------------------------------------------------------------'''
model_and_vectorized_data = make_anchored_topic_model(df, number_of_topics_to_find, min(number_of_documents_in_analysis, len(df)), dict_anchor_words, list_anchor_words_other_topics, list_rejected_words)
topics = report_topics(model_and_vectorized_data[0], dict_anchor_words,number_of_words_per_topic)
df_with_topics = create_df_with_topics(df, model_and_vectorized_data[0], model_and_vectorized_data[1], number_of_topics_to_find)
Topic #0 (Justice and Fairness): justice, fair, equality, fairness, unfair, justice department, criminal justice, unequal, status quo, quo Topic #1 (Privacy): privacy, data privacy, personal data, privacy concerns, privacy data, confidentiality, privacy protection, user privacy, privacy preserving, privacy security Topic #2 (Cyber-security): security, attacks, cyber, cybersecurity, malicious, national security, threats, hackers, cyber physical, secure Topic #3 (Environmnental Sustainability): environmental, sustainable, sustainability, angeles, los angeles, renewable, los, sustainable development, durable, renewable energy Topic #4 (Transparency): transparency, transparent, explainable, opaque, interpretable, explainable artificial, explainability, interpretability, black box, lack transparency Topic #5 (Accountability): accountability, accountable, texas, federal communications, communications commission, austin, fcc, mitchell, university texas, austin texas Topic #6 (Autonomy): decision making, autonomy, human beings, making, decision, beings, making process, ai decision, autonomous systems, intelligence decision Topic #7 (Democracy): political, democratic, democracy, elections, human rights, voting, politics, politicians, citizens, independent Topic #8 (Reliability): life, nothing, talk, predictable, eyes, door, guys, hope, bar, narrative Topic #9 (Trust): trust, confidence, trustworthy, trustworthiness, yorker, new yorker, honesty, charitable, public trust, multimillion Topic #10 (Well-being): welfare, well being, quality life, wellbeing, being, states economy, beijing trade, trolley, time need, skyscrapers Topic #11 (Inclusiveness): diversity, discrimination, inclusive, lack diversity, exclusion, buttigieg, inclusivity, diversity inclusion, mitch, mcconnell
3. Verifying the topic model ¶
To verify whether topics sufficiently refer to values, the code hereunder can be used to evaluate whether documents indeed address the value in question.
def plot_print_sample_articles_topic(selected_value, size_sample):
show_extracts = True # True, False
show_full_text = False # True, False
print_sample_articles_topic(df_with_topics, dict_anchor_words, topics, selected_value, size_sample, show_extracts, show_full_text)
interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], size_sample =(5,50, 5))
<function __main__.plot_print_sample_articles_topic(selected_value, size_sample)>
4. Gap assessment ¶
It takes time before a good topic model is build in which topics adequately represent values. The code in the next cell can be used to import an existing topic model.
def plot_values_in_different_datasets(Selected_technology):
values_in_different_datasets(df_with_topics, Selected_technology, dict_anchor_words)
interact(plot_values_in_different_datasets, Selected_technology=["AI", "IoT"])
<function __main__.plot_values_in_different_datasets(Selected_technology)>
def plot_print_sample_articles_topic(selected_technology, selected_value, selected_dataset, size_sample):
show_extracts = True # True, False
show_full_text = False # True, False
df_with_topics_selected_technology = df_with_topics[df_with_topics[selected_technology] == True]
df_with_topics_selected_technology_dataset = df_with_topics_selected_technology[df_with_topics_selected_technology['dataset'] == selected_dataset]
print_sample_articles_topic(df_with_topics_selected_technology_dataset, dict_anchor_words, topics, selected_value, size_sample, show_extracts, show_full_text)
interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["TECH", "NEWS", "ETHICS", ], size_sample =(5,50, 5), selected_technology=["AI", "IoT"])
<function __main__.plot_print_sample_articles_topic(selected_technology, selected_value, selected_dataset, size_sample)>
5. Impact assessment ¶
The occurence of values can be traced over time.
def plot_create_vis_values_over_time (selected_technology, selected_dataset, resampling, smoothing, max_value_y):
T0 = "1980-01-01" #YYYY-MM-DD
T1 = "2023-01-01" #YYYY-MM-DD
values_to_include_in_visualisation = []
resampling_dict = {"Year": "Y", "Month": "M", "Day": "D"}
resampling = resampling_dict[resampling]
df_with_topics_short = df_with_topics.loc[(df_with_topics['date'] >= dateutil.parser.parse(T0)) & (df_with_topics['date'] <= dateutil.parser.parse(T1))]
df_with_topics_selected_technology = df_with_topics_short[df_with_topics_short[selected_technology] == True]
df_with_topics_selected_technology_dataset = df_with_topics_selected_technology[df_with_topics_selected_technology['dataset'] == selected_dataset]
create_vis_values_over_time(df_with_topics_selected_technology_dataset, dict_anchor_words, resampling, values_to_include_in_visualisation, smoothing, max_value_y)
interact(plot_create_vis_values_over_time, selected_technology=["AI", "IoT"], selected_dataset = ["TECH", "NEWS", "ETHICS",], smoothing = (0.25,3, 0.25), max_value_y = (5,100, 5), resampling = ["Year", "Month", "Day"])
<function __main__.plot_create_vis_values_over_time(selected_technology, selected_dataset, resampling, smoothing, max_value_y)>
def plot_print_sample_articles_topic(selected_value, size_sample):
T0 = "1960-01-01" #YYYY-MM-DD
T1 = "2023-01-01" #YYYY-MM-DD
show_extracts = True # True, False
show_full_text = False # True, False
df_with_topics_short = df_with_topics.loc[(df_with_topics['date'] >= dateutil.parser.parse(T0)) & (df_with_topics['date'] <= dateutil.parser.parse(T1))]
print_sample_articles_topic(df_with_topics_short, dict_anchor_words, topics, selected_value, size_sample, show_extracts, show_full_text)
interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], size_sample =(5,50, 5))
<function __main__.plot_print_sample_articles_topic(selected_value, size_sample)>
6. Values in different societal groups ¶
ValueMonitor can be used to evaluate which values different societal groups tend to discuss.
def plot_values_in_different_groups(selected_dataset):
values_in_different_groups(df_with_topics, dict_anchor_words, selected_dataset)
interact(plot_values_in_different_groups, selected_dataset = ['NEWS', 'ETHICS', 'TECH'])
<function __main__.plot_values_in_different_groups(selected_dataset)>
def plot_print_sample_articles_topic(selected_value, selected_dataset, size_sample):
show_extracts = True # True, False
show_full_text = False # True, False
'''--------------------------------------------------------------------------'''
df_with_topics_selected_technology_dataset = df_with_topics[df_with_topics['dataset'] == selected_dataset]
print_sample_articles_topic(df_with_topics_selected_technology_dataset, dict_anchor_words, topics, selected_value, size_sample, show_extracts, show_full_text)
interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["TECH", "NEWS", "ETHICS", ], size_sample =(5,50, 5))
<function __main__.plot_print_sample_articles_topic(selected_value, selected_dataset, size_sample)>