During my day-to-day, I read papers and procrastinate from writing my thesis, so I often come up with high-level questions that I cannot research because I don’t have the experience, time, and computing resources. The following is such a research question which—if it has not been answered by someone else already—could be impactful. I’d be very happy about any pointers to relevant literature.

Can we expand active learning to asking questions?

We have lots of unlabeled data available, and labeling is expensive. This is where active learning comes in.

Given unlabeled and a (somewhat trained) model, we use the model to score the unlabeled data, label the most promising samples, and those to the training set.

This has proven quite effective and even when not, it has spawned lots of interest in the industry and in research. It is an attractive problem and useful because big companies have lots of data, yet labeling is expensive and time-consuming—labeling is often a bottleneck. The fewer labels we need, the better.

Now, active learning usually gives a general signal: the model would really like to kknow the label for some \(x\). But data is often more complex than that. Especially when it comes to NLP models like GPT-3 and others.

Can we expand active learning meaningfully to more complex data that does not have labels in the common sense?

NLP tasks entail text summarization and question answering. The naive active learning equivalent here would be to identify texts that an NLP model does not how to summarize or questions it does not how to answer with high certainty and then have annotators provide summaries or answers for these samples to train on.

This is very limiting and much harder than need be. Especially when we have large and complex contexts, it might be overkill, and the human annotators will also be more likely to make mistakes.

Can we improve upon this?

Yes, if we look at query synthesis and in particular sub-query synthesis.

In active learning, query synthesis is about creating samples to ask for labels. This has not worked very well in the past because data is complex to generate and human annotators often don’t understand the samples.

Sub-query synthesis could be different. In active learning for computer vision, there has been research into cropping images to only show the part of the image to annotators the model does not understand, which can make their tasks much easier.

The question is: can we apply this to NLP? Can we identify important parts of a text whose semantic meaning the model is uncertain about and generate targetted questions that would resolve this uncertainty?

For example, if we had a paragraph with a sentence that uses a pronoun, and it is not clear to the model who the subject is, we would want the model to highlight that section and ask “Who is she?”. An annotator could answer it more easily (and faster and with lower cost). Finally, the context, the question and the answer could be added to the training set to retrain the model.

An example

Here is more complex example—the text is copied from a news article:

But a report from the committee praised the government’s Net Zero Strategy.

A government spokeswoman welcomed the CCC’s generally positive response to the Net Zero Strategy and said it would meet all its climate change goals.

Boris Johnson has regularly promised that climate change can be tackled without what he calls “hairshirtery”.

Many experts agree technology is needed but say behaviour must change too.

They judge that the demand for high-carbon activities must be cut for the UK to meet climate targets in the 2030s.

The model might not understand some of the words and have high uncertainty about how to resolve the pronoun “they” and the people involved. So we would want to generate questions and rank them according to informativeness:

  1. “Who is judging that demand must be cute?”
  2. “Who is Boris Johnson?” (assuming the original training data is a bit older)
  3. “What is ‘hairshirtery?’”

These questions could be sent to annotators and the answers might be added to the training set as:

Original Text | Question | Answer

Within a Q&A task this would be natural: essentially, we add additional training data this way.

Obviously we could see this as a text completion problem, but it is not clear to me whether a question would be generated according to the highest uncertainty or only what the model’s own prediction of a relevant question would be. (I doubt that NLP models have a theory of “themselves” in any way and cannot easily predict their own uncertainty about something.)

More fine-grained research questions

  1. How can we capture epistemic uncertainty within a text context? Could we compute an embedding entropy per token by using a stochastic encoder but a single deterministic decoder?

  2. Can we learn to generate questions to clarify the most uncertain parts of a text?

  3. Could we turn this into a dialog, where given a context either evaluators ask questions and our model has to answer, and vice-versa?

Please cite this post if you find any of this useful, and/or it inspires your own research :hugs:

Thanks!

PS: It’s been a while since my last post (yet again). In the mean-time, I have shared paper reviews on Notion, set up a research blog about more gestated ideas, and published a few papers. I hope to resurrect this blog for smaller posts that will be useful for me and others.