Questions, and their responses, clarify our day to day queries and provide us the requisite information. Our questioning needs are facilitated by a plethora of platforms like stack overflow, Yahoo Answers!, Quora and many more. In all such platforms, many questions remain unattended and unanswered. Bothered by the number of unanswered questions, we inquired factors that can prevent a question from getting answered, or conversely, the factors that make any given questions a great question.
To understand the role of such factors, we employed a MISR dataset. Multiple Inquirer, Single Responder (or MISR) datasets, are Q&A sessions where many inquirers pose questions to a single responder. A classical MISR example would be a classroom, where multiple students seek answers from their teacher. Contrary to MISR datasets, SIMR (Single Inquirer Multiple Responder) datasets involve a single inquirer with multiple potential responders. SIMR datasets are abundant - Quora, stack-overflow fall in this category.
For our quest to understand the factors behind unanswered questions, we took advantage of Reddit AMA (or ask me anything) sessions, where notable personalities take questions from Reddit users. Personalities from various domains ranging from authors to actors have conducted an AMA on Reddit in the past. MISR datasets, like the Reddit AMA dataset, enable us to better control the variables associated with a typical Q&A forum. In these datasets, all the questions are posed to select people, hence it is easy to analyze their response trends.
The underlying factors can be roughly divided into two types : i) semantic factors and ii) non-semantic factors. Semantic factors capture the semantics of the question, whereas non semantic factors entail other information like the length of the question, politeness quotient of the question etc. Non semantic factors are simple to capture. Each individual factors’ role can be simply understood by the ease with which they can predict whether an unknown question would be answered or not? Of all the analysed factors, we observe that only temporal and redundancy factors have a noteworthy impact on the response rate (See fig 1). Questions that were original (non-redundant) and were asked earlier in the AMA had far higher chances of eliciting responses. When asked early, a question has lesser competing questions for the celebrity’s attention and hence stands a better chance in getting answered.
Non semantic factors are intuitive and insightful, however they fail to account for the content of the question. For instance, non-semantic features can never result in findings like - personal questions tend to be ignored often. It is hard to list down all such semantic factors that might possibly impact the response rate of the questions, and even if we could magically generate all such factors, it is nearly impossible to classify a new question into such factors, as there are no annotations to learn from. Owing to these issues, semantic features are relatively hard to capture. Automatically discovering underlying factors and topics in documents is a well studied problem, and there exist topic modelling techniques like LDA. However, LDA generated topics were highly incoherent, possibly because the length of the questions is considerably less as compared to documents. To our rescue, NMF (non-negative matrix factorization) techniques attain the needed coherent groupings. NMF techniques (see fig 2) were tuned to generate non-negative and sparse embeddings (NNSE) of questions. The rationale behind sparse and non-negative representation is simple and somewhat intuitive - if I were to describe Delhi to someone else, I would only talk about a few facts (hence sparse), and at the same time, I would talk about things that are true about Delhi, rather than mentioning things that aren’t present in Delhi (hence, non negative). These NNSE embeddings are known to be highly interpretable, and we second that. Only because of the interpretable dimensions (see fig 3) of these representations, we discover some interesting and hidden trends. For instance, actors generously answered questions about activities that occur behind the scenes, whereas they evaded questions about who their favourite actors, actress and movies were. Similarly, politicians eagerly responded and clarified all the questions about their campaign finances, however, they ignored questions pertaining to increasing unemployment.
Research around questioning techniques has been prevalent since the time of Socrates (i.e 400 BC). However, most of the research samples range about a few hundred classroom students. Also, quality of questions has never been analysed keeping in mind the response rate. We address this question, and discover many interesting trends by a large scale empirical evaluation on a novel Reddit AMA dataset. This study is a first step towards our goal of reformulating the questions in ways that increase its chances of getting answered.
PS: The paper about this work can be found here, and the Reddit AMA dataset is here.
-