*** Send an email to [masked] to be added to the email list.
Every day media generates large amounts of text. Often the political actors behind these texts are not clear and even if they are, when new political actors appear their policies cannot necessarily be readily associated with established positions in the political spectrum.
In this context automated analyses of political content can help political scientists and the average media consumer to better understand and judge media content. At the intersection of social sciences and data science a new field is emerging that addresses this issue.
PyData Berlin is organising a follow up event to Felix Biesmann’s talk ( https://www.youtube.com/watch?v=IhUSiXXg4rg ) at the PyData Berlin 2016 conference. The goal of this event is to carry further the research that Felix started and bring together experts from the fields of social sciences, computer science, machine learning and user interface design in order to come up with novel hypotheses, innovative analyses and easily accessible user experiences to gain insights into the rapidly evolving political landscape. We offer a set of prerequisites that will help foster ideas and get developers up to speed quickly.
Friday Sept 30th:
18:00-21:00 Get together at DSR
Meet the other people, have a few drinks and talk about what to do.
Saturday Oct 1st:
10:00-12:00 Data Ambassador Presentations
The program starts Saturday morning with presentations from people who are familiar with the datasets. Sebastian Schelter will talk about the social network data, and Pola Lehman from the Berlin Social Science Center will give an overview of the manifesto data. This will give all participants an opportunity to ask questions and clarifications on what to do with the data.
12:00-22:00 hack hack hack
Sunday Oct 2nd
10:00-17:00 hack hack hack
17:00-19:00 presentations / closing notes / cleanup
We have secured a sponsorship from AWS that will allow you to use AWS computing resources at no additional cost. If you do not already have an AWS account, please create one before the event. During the event you'll get a voucher for AWS credit which you can then use to load credit onto your account.
Amazon has agreed to sponsor lunch for Saturday and Sunday, however as this is a free event we will not be able to provide food for the entire weekend. There are numerous restaurants around DSR, and there's of course always pizza delivery :). We've also agreed with a local brewer that they will provide some beer for the event.
We have three different datasets that are preprocessed and ready to use:
1) Plenary debates of German and European parliament (with party labels) http://www.statmt.org/europarl/
2) Party manifestos (with party labels and policy labels) https://manifestoproject.wzb.eu/
3) Social network data (facebook posts)
There are some software projects that offer utilities that can be used to speed up data preprocessing
Participants are invited to explore the datasets and come with ways of analysing the data. To get you started we’ve come up with a list of ideas to explore that we think are interesting both for machine learning people as well as political scientists:
1) We have some results on predicting political bias on German texts, other research has addressed this in other languages. A simple first direction would be to train one model for as many languages as possible on political bias prediction and expose these models in a unified API, maybe even a python package.
2) Unilanguage models are useful but multi language models could be more powerful. Using the same data as in 1) we’re interested in exploring the feasibility of training a multi-view model on multiple languages simultaneously.
3) A major problem with automatic political prediction models is their inability to deal with short texts, such as posts in social media. Word embeddings could be used to enrich the data obtained from short texts and improve model performance. Deep learning models have been shown to work extremely well on text classification tasks and are another possibility for experimentation.
4) Machine learning models are prone to poor generalization performance when the data used for testing is from a different text domain than the data used for training. For example a classifier trained on parliament speeches will perform poorly on social network posts. There are many ways how to counteract this effect. Exploring some of them would help building more robust models.
5) How bad are ML models for political content prediction? We don’t know. Especially as we don’t know how bad humans are. One fun project could be to set up a web app that allows humans to judge texts and/or have them play against a machine. Several gamification settings are possible.
6) Social network data has more than just text. How can we exploit that data best?
- Felix Biessmann
- Christos Christodoulou
- Katherine Jarmul
- Matti Lyra