Dear Gregor Wiedemann, what applications of Natural Language Processing, i.e., automated text and language processing based on artificial intelligence, do you currently see in science and especially in the social sciences?
One focus of Natural Language Processing applications in the social sciences is the assessment of very large amounts of text for automated content analysis. This makes, for example, debates in social media or across many news media accessible for research that otherwise would be impossible to handle manually.
What are you using NLP for at the moment, personally?
I'm currently working on NLP methods for evaluating argument structures in Twitter debates. Among other things, we are observing how public discourse on the use of nuclear energy has changed in recent years. For example, environmental aspects play a role in pro-nuclear arguments much more often, which is a significant change from previous decades. In a second project, we extract protest events such as demonstrations, rallies or strikes from local media and prepare these data for protest event research.
Let's assume I have collected tens of thousands of text documents in my project that I want to analyze. How can Natural Language Processing help me with this?
Let's stay with the example of protest event research. Here, for example, a local newspaper reports on a Pegida rally in Dresden. To find this coverage automatically, we use a classification method that automatically detects whether articles containing the word "Pegida protests" are reporting on a protest event or whether the word is only used in a more general context. This allows us to reliably identify only those articles that actually report on protest events from the large set of all newspaper reports that contain these keywords. In a second step, we automatically extract information such as motto, number of participants and organizers from these articles. The end result is a structured dataset of all local protest events from four local newspapers over the last 20 years, which can then be analyzed by political scientists.
What requirements must my text data meet so that I can analyze it using Natural Language Processing?
Ideally, the training data for a model should come from the same population as the target data. The texts must also be available in digital form, of course, and they should not deviate too much from the standard language. That means, for example, transcripts of spoken language that depict strong slang, or historical documents that use a very old language, can sometimes cause problems. But there are solutions for this. In these cases, the language models used must be adapted for the target domain. By the way, it is now easily possible to use multi-lingual corpora (e.g. German and English) in the same analysis, or to make good predictions for German target data with English training data.
And what prerequisites do I as scientist have to bring along to work with NLP? How much programming knowledge do I need or other relevant prior knowledge?
For some years now, NLP methods are mainly used on the basis of multi-layer neural networks (also known as "deep learning"), because they perform significantly better than earlier approaches, which were based on word lists, for example. In order to use these neural networks for large amounts of text, it is necessary to deal with certain program libraries, which are usually written for the script language Python. Nowadays, there are also first so-called R-wrappers, which make the functionalities of the Python libraries available in the R programming language. However, there is currently no way around programming your own analysis scripts.
Before I can analyze my text data, I have to tell the algorithm what to look for and what to do with what it finds. How do you do that?
In order to teach a machine which texts to extract as relevant from a large set, which categories to sort a text into, or what information to extract from a document, I need to teach it what that information looks like using examples. To do this, a training data set is created, typically containing a few hundred or thousand positive and negative examples for the target category. With new technologies called Few Shot Learning, significantly fewer examples are already sufficient to train a pre-trained neural network on a target category.
What does my training data set have to look like so that the quality of the analysis is high and reliable afterwards?
The training data for training an NLP model should be as complete and as uniformly coded as possible. Complete means that there is a conscious category decision for all entities (words, sentences, or documents) presented to the model. Consistent (or reliable) means that if several coders create a data set, they must also reach the same judgment about category assignment in the same cases. If the machine is presented with inconsistent training data, it will not be able to learn a category adequately.
How can I tell that the NLP is working reliably?
To tell if an NLP model works reliably, it is tested with test data. Test data, just like training data, are hand-coded texts. The predictions of a trained model on the test data are then compared with the hand-coded categories. In this way, a statement about the quality of the automatic prediction is possible.
The technology in this area is rapidly progressing. What developments have advanced NLP in particular recently?
We are currently experiencing rapid progress in the field of artificial intelligence and machine text understanding with so-called large language models. The most recent example is ChatGPT from the company OpenAI. This language model is able to understand requests and responses from users and thus generate human-like dialogic communication.
What are the limitations of NLP from your point of view?
Although language models currently already perform impressively by presenting the knowledge stored in them to users in dialogic form, their internal structure lacks symbolic-logical knowledge representation that is transparently comprehensible from the outside. The goal of current research is to make this "black box" decodable within neural networks and thus also to make sure that more formally correct and better validated knowledge can be stored in these models.
---
About Dr. Gregor Wiedemann
Dr. Gregor Wiedemann is Senior Researcher Computational Social Science at the Leibniz Institute for Media Research │ Hans Bredow Institute (HBI), where he directs the Media Research Methods Lab (MRML) together with Sascha Hölig. His current work focuses on the development of methods and applications of Natural Language Processing and Text Mining for empirical social and media research.
Wiedemann studied political science and computer science in Leipzig and Miami. In 2016, he completed his doctorate in computer science at the University of Leipzig in the Automatic Language Processing department on the possibilities of automating discourse and content analysis using text mining and machine learning methods. He then worked as a postdoc in the Department of Language Technology at the University of Hamburg.
He was working on methods for unsupervised information extraction to support investigative research in unknown document repositories (see newsleak.io) and the detection of hate speech and counter-speech in social media.
Until taking over the leadership of the MRML, he was working in the DFG project "A framework for argument mining and evaluation (FAME)", which deals with the automatic detection and evaluation of recurrent argument structures in empirical texts.
(Source of short biography: Hans Bredow Institute)