Skip to content

Latest commit

 

History

History
187 lines (152 loc) · 8.68 KB

File metadata and controls

187 lines (152 loc) · 8.68 KB

Data Format

AirQA Example Format

We have included the full QA data in our repository at data/test_data.jsonl, where each line represents an example in the following format:

{
    "uuid": "xxxx-xxxx-xxxx", // unique identifier for this data sample
    "question": "user question about ai research papers", // user question
    "answer_format": "text description on answer format, e.g., a single float number, a list of strings", // can be inserted into prompt
    "tags": [
        "tag1",
        "tag2"
    ], // different tags for the data or task sample, see below for definition
    "anchor_pdf": [
    ], // UUIDs of the papers that are explicitly mentioned in the question
    "reference_pdf": [
    ], // UUIDs of papers that may be used but not provided in the question
    "conference": [
        "acl2023"
    ], // define the search space of papers, usually conference+year
    "evaluator": {
        "eval_func": "function_name_to_call", // all eval functions are defined under `evaluation/` folder
        "eval_kwargs": {
            "gold": "ground truth answer",
            "lowercase": true
        } // the gold or reference answer should be included in `eval_kwargs` dict. Other optional keyword arguments can be used for customization and function re-use, e.g., `lowercase=True` and `threshold=0.95`.
    } // A complex dict specifying the evaluation function and its parameters.
}

Tags of AirQA

This section describes different question categories (or tags) for classification. In each example, tags contain:

Category 1: Task Goals

  • single: querying detailed information from a specific paper, e.g.,
    • Which downstream tasks does CLiCoTEA outperform other models in terms of zero-shot performance on the IGLUE benchmark? (12a70e18-aa46-5779-bd69-2f3620d7f484)
  • multiple: posing questions across multiple papers, e.g.,
    • According to this survey, what're the three most recent decoder-only LLMs for NL2Code? How many programming languages do their training datasets each contain? (a3c6958b-aed2-5e28-8dea-5d0b88550ac8)
  • retrieval: retrieving papers from a specific conference in a particular year, based on the description, e.g.,
    • Which paper unifies reinforcement learning and imitation learning methods under a dual framework? (bd90400d-f8bf-5257-a64b-906a477992a8)
  • comprehensive: a combination of the three aforementioned question types, e.g.,
    • Among the text-to-SQL papers in ACL 2023, which one achieves the best testsuite accuracy on the SPIDER dataset? Tell me the paper title and corresponding test accuracy. (bff9b330-bcd6-547f-8a07-2af88d99540d)

Category 2: Key Capabilities

  • text: Q&A that focuses on text understanding and reasoning
  • table: Q&A that requires identifying tables and their contents
  • image: Q&A that involves the recognition of figures, charts, or graphs
  • formula: Q&A that queries the details of math formulas
  • metadata: metadata includes authors, institutes, e-mails, conferences, years and other information that does not appear in the main text (e.g., page header and footer)

💡 Note One example may involve several key capabilities, and we mark one example as text if and only if it doesn't contain any other key capability.

Category 3: Evaluation Types

  • subjective: answers that require LLM or model-based evaluation
  • objective: answers that can be evaluated with objective metrics

📚 Note More details on evaluation could be found in the Evaluation document.

Paper Retrieval Scale

In AirQA, the fields anchor_pdf, reference_pdf, and conference together define the set of papers associated with a given example. While anchor_pdf and conference may be provided as context in the prompt, reference_pdf must be withheld, as it constitutes part of the ground-truth answer.

To clarify their roles:

  • anchor_pdf lists the papers explicitly mentioned or directly referenced in the question (i.e., the "anchor" documents the model can use to reason).
  • reference_pdf contains papers that are not mentioned in the question but are required to formulate the correct answer. These papers should remain hidden during inference to avoid leaking answer information.

Example 1: Direct Reference (No Hidden Information)

{
    "question": "Model A and model B, which is better?",
    "anchor_pdf": ["Model A", "Model B"],
    "reference_pdf": []
}

💡 Explanation The question directly names both models, so all necessary information is expected to be found in their respective papers (anchor_pdf). No external reference is needed, hence reference_pdf is empty.

Example 2: Indirect Reference (Hidden Answer Source)

{
    "question": "On which dataset is Model A trained? What's the scale of this dataset?",
    "anchor_pdf": ["Model A"],
    "reference_pdf": ["Dataset B"]
}

⚠️ Explanation While the question only mentions Model A, the answer depends on details from Dataset B, which is not named in the question. Therefore, Dataset B is placed in reference_pdf and must not be included in the prompt. Otherwise, the model would receive the answer indirectly.

For the four question types in AirQA, the structure of anchor_pdf, reference_pdf, and conference is determined by their characteristics. Specifically:

  • For single type, the question explicitly refers to one paper. Thus, anchor_pdf contains exactly one entry (the mentioned paper), and no reference_pdf is needed.
  • For multiple type, the question explicitly mentions multiple papers. Accordingly, anchor_pdf includes all mentioned papers (at least one), while reference_pdf may contain additional papers required to answer the question but not named in the prompt (as shown in the previous two examples).
  • For retrieval type, the question describes a paper without naming it. Here, anchor_pdf is empty; instead, conference specifies the search scope, and reference_pdf contains the target paper, which is the answer itself.
  • For comprehensive type, the question requires identifying relevant papers from a conference based on descriptions (similar to retrieval), and then reasoning over them. Therefore, no papers are directly provided in anchor_pdf; instead, the relevant papers are placed in reference_pdf, and the model must first retrieve them using conference and contextual hints before answering.

Paper Metadata Format

The template of the paper metadata (data/metadata/*.json) is:

{
    "uuid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", // UUID
    "title": "This is the title of the paper.", // paper title
    "conference_full": "...", // full title of the conference
    "conference": "...", // conference abbreviation
    "year": 2026, // conference year, or which year is this paper published
    "volume": "...", // volume title
    "bibtex": "...", // bibtex citation text
    "authors": ["Author 1", "Author 2", ...], // authors list
    "num_pages": 10, // int value
    "pdf_url": "https://...", // URL to download the PDF, should end with .pdf
    "pdf_path": "data/papers/{conference}/{uuid}.pdf", // local path to save the PDF
    "abstract": "...", // paper abstract text
    "tldr": "...", // TLDR
    "tags": ["keyword", "tag"] // keywords or tags
}

Paper Processed Data Format

We use MinerU to extract key elements (e.g., text, tables, figures, and equations) from the raw PDFs. To facilitate future use, we provide the extracted content (data/processed_data/*.json) in the following JSON format:

{
    "tables": [
        {
            "table_caption": "Table 1: ...",
            "table_html": "<html><body><table>...</table></body></html>",
            "table_bbox": [0.0, 0.0, 100.0, 100.0],
            "page_number": 2
        },
        ... more tables ...
    ],
    "figures": [
        {
            "figure_caption": "Figure 1 ...",
            "figure_bbox": [10.0, 10.0, 50.0, 50.0],
            "page_number": 1
        },
        ... more figures ...
    ],
    "equations": [
        {
            "equation_text": "$$ ... $$",
            "page_number": 3
        },
        ... more equations ...
    ],
    "references": [
        {
            "reference_text": "..."
        },
        ... more references ...
    ],
    "TOC": [
        {
            "title": "Section 1", 
            "text": "section text",
            "level": 1,
            "page_number": 1,
            "page_numbers": [
                1
            ]
        },
        ... more text blocks ...
    ]
}

⚠️ Note These extracted results are provided as a reference only and are not guaranteed to be perfect. If needed, you can always reprocess the original PDFs using your preferred toolchain.