We have included the full QA data in our repository at data/test_data.jsonl, where each line represents an example in the following format:
{
"uuid": "xxxx-xxxx-xxxx", // unique identifier for this data sample
"question": "user question about ai research papers", // user question
"answer_format": "text description on answer format, e.g., a single float number, a list of strings", // can be inserted into prompt
"tags": [
"tag1",
"tag2"
], // different tags for the data or task sample, see below for definition
"anchor_pdf": [
], // UUIDs of the papers that are explicitly mentioned in the question
"reference_pdf": [
], // UUIDs of papers that may be used but not provided in the question
"conference": [
"acl2023"
], // define the search space of papers, usually conference+year
"evaluator": {
"eval_func": "function_name_to_call", // all eval functions are defined under `evaluation/` folder
"eval_kwargs": {
"gold": "ground truth answer",
"lowercase": true
} // the gold or reference answer should be included in `eval_kwargs` dict. Other optional keyword arguments can be used for customization and function re-use, e.g., `lowercase=True` and `threshold=0.95`.
} // A complex dict specifying the evaluation function and its parameters.
}This section describes different question categories (or tags) for classification. In each example, tags contain:
single: querying detailed information from a specific paper, e.g.,- Which downstream tasks does CLiCoTEA outperform other models in terms of zero-shot performance on the IGLUE benchmark? (
12a70e18-aa46-5779-bd69-2f3620d7f484)
- Which downstream tasks does CLiCoTEA outperform other models in terms of zero-shot performance on the IGLUE benchmark? (
multiple: posing questions across multiple papers, e.g.,- According to this survey, what're the three most recent decoder-only LLMs for NL2Code? How many programming languages do their training datasets each contain? (
a3c6958b-aed2-5e28-8dea-5d0b88550ac8)
- According to this survey, what're the three most recent decoder-only LLMs for NL2Code? How many programming languages do their training datasets each contain? (
retrieval: retrieving papers from a specific conference in a particular year, based on the description, e.g.,- Which paper unifies reinforcement learning and imitation learning methods under a dual framework? (
bd90400d-f8bf-5257-a64b-906a477992a8)
- Which paper unifies reinforcement learning and imitation learning methods under a dual framework? (
comprehensive: a combination of the three aforementioned question types, e.g.,- Among the text-to-SQL papers in ACL 2023, which one achieves the best testsuite accuracy on the SPIDER dataset? Tell me the paper title and corresponding test accuracy. (
bff9b330-bcd6-547f-8a07-2af88d99540d)
- Among the text-to-SQL papers in ACL 2023, which one achieves the best testsuite accuracy on the SPIDER dataset? Tell me the paper title and corresponding test accuracy. (
text: Q&A that focuses on text understanding and reasoningtable: Q&A that requires identifying tables and their contentsimage: Q&A that involves the recognition of figures, charts, or graphsformula: Q&A that queries the details of math formulasmetadata: metadata includes authors, institutes, e-mails, conferences, years and other information that does not appear in the main text (e.g., page header and footer)
💡 Note One example may involve several key capabilities, and we mark one example as
textif and only if it doesn't contain any other key capability.
subjective: answers that require LLM or model-based evaluationobjective: answers that can be evaluated with objective metrics
📚 Note More details on evaluation could be found in the Evaluation document.
In AirQA, the fields anchor_pdf, reference_pdf, and conference together define the set of papers associated with a given example. While anchor_pdf and conference may be provided as context in the prompt, reference_pdf must be withheld, as it constitutes part of the ground-truth answer.
To clarify their roles:
anchor_pdflists the papers explicitly mentioned or directly referenced in the question (i.e., the "anchor" documents the model can use to reason).reference_pdfcontains papers that are not mentioned in the question but are required to formulate the correct answer. These papers should remain hidden during inference to avoid leaking answer information.
Example 1: Direct Reference (No Hidden Information)
{
"question": "Model A and model B, which is better?",
"anchor_pdf": ["Model A", "Model B"],
"reference_pdf": []
}💡 Explanation The question directly names both models, so all necessary information is expected to be found in their respective papers (
anchor_pdf). No external reference is needed, hencereference_pdfis empty.
Example 2: Indirect Reference (Hidden Answer Source)
{
"question": "On which dataset is Model A trained? What's the scale of this dataset?",
"anchor_pdf": ["Model A"],
"reference_pdf": ["Dataset B"]
}
⚠️ Explanation While the question only mentions Model A, the answer depends on details from Dataset B, which is not named in the question. Therefore, Dataset B is placed inreference_pdfand must not be included in the prompt. Otherwise, the model would receive the answer indirectly.
For the four question types in AirQA, the structure of anchor_pdf, reference_pdf, and conference is determined by their characteristics. Specifically:
- For
singletype, the question explicitly refers to one paper. Thus,anchor_pdfcontains exactly one entry (the mentioned paper), and noreference_pdfis needed. - For
multipletype, the question explicitly mentions multiple papers. Accordingly,anchor_pdfincludes all mentioned papers (at least one), whilereference_pdfmay contain additional papers required to answer the question but not named in the prompt (as shown in the previous two examples). - For
retrievaltype, the question describes a paper without naming it. Here,anchor_pdfis empty; instead,conferencespecifies the search scope, andreference_pdfcontains the target paper, which is the answer itself. - For
comprehensivetype, the question requires identifying relevant papers from a conference based on descriptions (similar toretrieval), and then reasoning over them. Therefore, no papers are directly provided inanchor_pdf; instead, the relevant papers are placed inreference_pdf, and the model must first retrieve them usingconferenceand contextual hints before answering.
The template of the paper metadata (data/metadata/*.json) is:
{
"uuid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", // UUID
"title": "This is the title of the paper.", // paper title
"conference_full": "...", // full title of the conference
"conference": "...", // conference abbreviation
"year": 2026, // conference year, or which year is this paper published
"volume": "...", // volume title
"bibtex": "...", // bibtex citation text
"authors": ["Author 1", "Author 2", ...], // authors list
"num_pages": 10, // int value
"pdf_url": "https://...", // URL to download the PDF, should end with .pdf
"pdf_path": "data/papers/{conference}/{uuid}.pdf", // local path to save the PDF
"abstract": "...", // paper abstract text
"tldr": "...", // TLDR
"tags": ["keyword", "tag"] // keywords or tags
}We use MinerU to extract key elements (e.g., text, tables, figures, and equations) from the raw PDFs. To facilitate future use, we provide the extracted content (data/processed_data/*.json) in the following JSON format:
{
"tables": [
{
"table_caption": "Table 1: ...",
"table_html": "<html><body><table>...</table></body></html>",
"table_bbox": [0.0, 0.0, 100.0, 100.0],
"page_number": 2
},
... more tables ...
],
"figures": [
{
"figure_caption": "Figure 1 ...",
"figure_bbox": [10.0, 10.0, 50.0, 50.0],
"page_number": 1
},
... more figures ...
],
"equations": [
{
"equation_text": "$$ ... $$",
"page_number": 3
},
... more equations ...
],
"references": [
{
"reference_text": "..."
},
... more references ...
],
"TOC": [
{
"title": "Section 1",
"text": "section text",
"level": 1,
"page_number": 1,
"page_numbers": [
1
]
},
... more text blocks ...
]
}
⚠️ Note These extracted results are provided as a reference only and are not guaranteed to be perfect. If needed, you can always reprocess the original PDFs using your preferred toolchain.