AKBC 2022

It’s been more than 10 years since I presented one of my first papers at the first conference on Automated Knowledge Base Construction (AKBC) in Grenoble in 2010 😬

I still remember how intimidated I felt by all those research stars with their proven track record, amazing papers, citations - and me, with my little paper that I needed to correct at last minute because I made a severe error in the evaluation code (my passion for automated tests was born on that day 🐥). Still, I had a really good time, met many nice people and in the end managed to present my work.

10 plus years later, having switched from research and the Wikipedia-domain to industry and healthcare data, the number of challenges have rather increased, especially when tackling German data. In the (almost entire) absence of publicly available structured data, the need for automated knowledge base construction is even more pressing. That’s why I went to London to attend AKBC2022! Great schedule, great location, Great Britain, here I come! 🙌

Barbican Centre feels like a mixture of an old James Bond Movie and a huge botanic garden - amazing!

The remainder of this post is a (heavily biased) collection of the talks, links, repos, papers, (handy cam) pictures, ideas, thoughts… that I found most intersting. If you are looking for a detailed summary of the conference, you will not find that here, sorry 🤷‍♀

Talks, Keynotes & Papers

Temporal and Multilingual Knowledge Graphs

🔈 Partha Talukdar

The world is dynamic and Knowledge Graphs (KG) should reflect this time-varying nature. Temporal KGs provide a mechanism to represent this dynamism through time-scoped edges. While question answering over traditional KGs have received quite a bit of attention, QA over Temporal KGs is still in early stages. […]

Knowledge Graphs and Language Models.

🔈 Tom Hope

Knowledge graph (KG) link prediction is a fundamental task in artificial intelligence, with applications in natural language processing, information retrieval, and biomedicine. Recently, promising results have been achieved by leveraging cross-modal information in KGs, using ensembles that combine knowledge graph embeddings (KGEs) and contextual language models (LMs). […]

Cross modal: combine knowledge graph embeddings and text (language models).

Example relation: "Aspirin" treats what?

more on CascadER

Data driven policy for pandemic response: data, knowledge, and action well outside of the ivory tower

🔈 Jesse Tenenbaum, Chief Data Office at DHHS

The “central dogma” of informatics, it has been said, is turning data into information and information into knowledge. Some take this framework further- knowledge informs action. Moreover, in the Learning Health System model, action begets more real-world data, which can be converted into information, etc. […]

Really great talk with lots of puns and fun anecdotes!

DIKA: Data, Information, Knowledge, Action

Data Strategy Framework

Probabilistic fuzzy match between data silos: use soundex to handle typos

Worth reading: NCDHHS’s Data Sharing Guidebook
Soundex

Project Alexandria in Viva Topics: AKBC in practice

🔈 Yordan Zaykov

At AKBC 2019, we presented Project Alexandria as a solution to inferring a knowledge base completely automatically from unstructured data. Since then, we have built Alexandria into the heart of a new Microsoft product called Viva Topics, launched last year. Viva Topics automatically constructs a knowledge base from an organization’s documents and intranet pages, and surfaces it across a wide range of Microsoft applications including SharePoint, Teams, Outlook and more. […]

Generative model of Entity Llinking

Knowledge Base construction without looking at the data
Entity fragments: pieces of the same entity that are not yet merged
Supports multiple languages (tokenization and things like this might be needed to be adapted)
More about Project Alexandria

Trustworthy Natural Language Generation with Communicative Goals

🔈 Dipanjan Das

While recent work in large language models have made natural language generation fluent, these models suffer from content “hallucination”, where model-generated statements are not attributable to sources in communicative scenarios (e.g. summarization, question answering and responses in dialogue systems) […]

Query-focused Multi Document Summarization

Informative Dialogue Agents - a chatGPT preview

NLP for Science: Advances and Challenges

With over one million papers added every year to the PubMed biomedical index alone — the explosion of scholarly knowledge presents tremendous opportunities for accelerating research across the sciences. However, the complexity of scientific literature presents formidable challenges for existing AI and NLP technologies, limiting our ability to tap into this vast treasure trove of information. In this talk, I will present our recent work toward helping researchers and clinicians make use of knowledge embedded in the literature. […]

Literature-Augmented Clinical Outcome Prediction

BEEP (Biomedical Evidence- Enhanced Predictions): automatically retrieve patient-specific literature based on intensive care EHR notes, use the literature to enhance clinical outcome prediction
adding literature significantly reduces error (over 25% increase in precision@top-k scores)
potential next step: incorporate evidence identification and inference directly into retrieval and predictive models

Retrieval-augmented Language Models for clinical outcome prediction

Integration of graph and text modalities

🤔 Does this additional context also help for less complicated cases? I.e. patients that are not in the ICU?

Keeping LMs in sync with the real world

🔈 Angeliki Lazardiou

Our world is open-ended, non-stationary and constantly evolving; thus what we talk about and how we talk about it changes over time. This inherent dynamic nature of language comes in stark contrast to the established static paradigm of NLP. This staticness has led over the years to a number of peculiarities; our models are “stuck” to the time they were trained, our systems are not designed to be easily adaptive, and our benchmarks further perpetuate this vicious circle.

Language Models get outdated over time

Searching the Internet and conditioning on top 5 results is worth more than 273 billion parameters!

Approach uses google queries but is therefore not strictly reproducible, as google results are prone to change: „If we stick on being reproducible, we are not going to make progress.“

Anti-vaccination arguments: a conceptual taxonomy and a machine-learning model

🔈 Stephan Lewandowsky

[…] Through a preregistered systematic literature review and thematic analysis of anti-vaccination arguments, we developed a hierarchical taxonomy that relates common arguments and themes to 11 attitude roots that explain why an individual might express opposition to vaccination. We further validated our taxonomy on COVID-19 anti-vaccination misinformation, through a combination of human coding and machine learning using natural language processing algorithms. […]

Anti-vaccination arguments: a conceptual taxonomy and a machinelearning model

QuALITY: Question Answering with Long Input Texts, Yes!

To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an aver- age length of about 5,000 tokens, much longer than typical current models can process. Un- like in prior work with passages, our ques- tions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. […]

Entity-Centric Query Refinement

We introduce the task of entity-centric query refinement. Given an input query whose answer is a (potentially large) collection of entities, the task output is a small set of query refinements meant to assist the user in efficient domain exploration and entity discovery.

Few-Shot Inductive Learning on Temporal Knowledge Graphs using Concept-Aware Information

Knowledge graph completion (KGC) aims to predict the missing links among knowledge graph (KG) entities. Though various methods have been developed for KGC, most of them can only deal with the KG entities seen in the training set and cannot perform well in predicting links concerning novel entities in the test set.

Poster Session

Poster session in the Barbican Conservatory, London's second largest botanic garden.

Pseudo-Riemannian Embedding Models for Multi-Relational Graph Representations

This basically can handle two super similar entities that are not to be linked (something you can not do with Gaussian spaces). Example: genes that are triggered by the same transcription factor but result in totally different proteins.

Pseudo-Riemannian Embedding Models for Multi-Relational Graph Representations