Research documents clustering for CAMERA
The Horizon 2020 Coordination and support action CAMERA evaluates the impact of European mobility-related projects. The CORDIS database presents a high volume of unclassified project data to which manual methodologies would be impossible to apply due to the high dimensionality of the dataset. Also, not all of the projects presented in the CORDIS database are related to mobility.
These problems show the necessity of using algorithms to detect patterns within the corpus of documents presented in the database. By using automated methodologies in non-classified databases, we can amplify the scope of the project. This implies looking at all texts – including those normally unaffiliated with the topic of mobility but that may present soft relation with mobility areas. Also, by developing a data-driven statistical model, more metrics regarding the projects can be designed and assessed.
The rise of statistical Natural Language Processing (NLP)
Natural Language Processing (NLP) is a branch of artificial intelligence that aims to make computers “understand”, interpret and manipulate human language. NLP combines different disciplines including computer science, computational linguistics and statistical models in its pursuit to fill the gap between human communication and computer automation. The main challenges faced by NLP are speech recognition, natural language understanding and natural language generation.
Since the so-called “statistical revolution” in the late 1980s, NLP research has relied heavily on machine learning. The machine learning methodology focuses on using statistical inference, or automatically learning the rules of natural language through the data analysis of a large corpora of real-world examples. In machine learning, a corpus is a set documents with human or computer annotations that can be used to generate a large set of “features” that “teach” algorithms to “understand” the relationships within the documents.
In CAMERA, we analyze more than 40.000 projects from 2007 to 2020. The document corpus is composed by concatenating the “title” and “objective” of each project, yielding a combined average text length of 4200 words per document. The documents can’t easily be labeled by topic, so no prior information is known regarding its content or the amount of topics covered. This prevents the use of supervised machine learning techniques to classify the documents by topic, adding a new layer of difficulty.
Furthermore, the texts are expected to present complex topical distribution with soft links between subtopics and documents belonging to multiple topics. Among all the research areas covered in the European Union, only mobility is relevant in CAMERA. This means that we need to identify the right topical distributions in a very sparse space. In this context, traditional unsupervised machine learning algorithms (e.g. clustering) do not perform well because they give fixed classes to texts, limiting the possibilities of hybrid topic distributions. In this scenario, more complex methodologies are required, such as using probabilistic clustering to generate topic models.
Topic modelling in CAMERA
Topic modeling is a well-known tool for discovering hidden semantic structures in a corpus of documents. Topic models learn many related words from large corpora without any supervision. Based on the words used within a document, they mine topic level relationships by assuming that a single document covers a small set of concise topics. Furthermore, the output of the algorithm is a cluster of terms that identify a “topic”. The topic model can be very useful for quickly examining a large set of texts and automating the discovery of topical relationships between them.
The most popular topic modeling algorithm is Laten Dirichlet Allocation (LDA). LDA is a three-level hierarchical Bayesian model that fits words and documents over an underlying set of topics. The main particularity of this algorithm is that it is an unsupervised generative statistical model that allows sets of observations to be explained by unobserved groups that break down why some parts of the data are similar. In this case, the observations are words collected into documents and each document can be presented as mixture of a small number of topics.
The most important aspect of LDA, the most relevant for the CAMERA objective, is that it is a matrix factorization technique. Any collection of documents can be represented in a vector space as a document-term matrix. The document-term matrix gives the frequency count of a word (represented as columns) in a Document (represented as rows). LDA decomposes this document-term matrix into two lower dimensional matrices: the document-topics matrix and the topic-terms matrix with dimensions (N,K) and (K,M) respectively where K is the number of topics (a parameter fixed by the analyst), M is the number of documents and N is the number of distinct terms.
By using the LDA topic modeling approach, we can analyze the corpus of documents and iteratively extract the documents with higher probability of belonging to mobility-related topics. After we extract the most relevant documents we can run a topic modelling again and extract the distribution of mobility-related subtopics. This will give us quantitative metrics such as the grade of coverage in specific research areas, correlations between topics and similarity metrics to find similar projects.
But the methodology presents another problem: with LDA being an unsupervised algorithm, we cannot “choose” which topics are interesting and which not. This is a huge issue when looking for a certain topic or distribution of topics. How did we solve this problem? Stay tuned for my next post on how we turned the unsupervised LDA methodology to semi-supervised.