🎯 TARGET: Benchmarking Table Retrieval for Generative Tasks

UC Berkeley
*Now at CWI
Overview diagram of the TARGET benchmark for evaluating table retrieval for generative tasks

Overview of the TARGET benchmark.

Abstract

The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks.

The TARGET Benchmark

What is TARGET?
TARGET is the first benchmark for evaluating open-domain querying over tabular data as illustrated in the below figure. TARGET enables consistent and comprehensive evaluation of models and pipelines for table retrieval in isolation, as well as end-to-end for downstream tasks (question answering, fact verification, and text-to-SQL). In our paper, we use TARGET to analyze retrieval methods based on sparse lexical representations, dense embeddings of metadata, dense table embeddings, and dense row embedding.
Pipeline of open domain question answering over tabular data, in which no tables for the question are provided but are needed to be retrieved first.


Why TARGET?
  • Existing systems and benchmarks for retrieval-augmented generation (RAG) pipelines mainly focus on retrieval of text, images, and audio data. The value of tabular data for RAG is overlooked, while tables typically contain fresh, reliable and domain-specific data. As we show in our paper, grounding LLM dialogues in tabular data significantly improve accuracy.
  • We connect LLM-powered analytical querying over structured data (e.g. through table lookups or SQL generation) with a retriever, extending ``closed domain'' querying systems (e.g. through table reasoning or SQL generation) to the ``open domain'' setting.

  • The curated datasets and the TARGET python package (see above links) are designed for easy reuse for custom systems for RAG with tabular data.

    Key findings

    Retrieval
    We find that table retrieval based on sparse lexical representations such as BM25 (OTTQA) are less effective, across tasks and datasets, than they are for text. The importance of descriptive metadata for retrievers based on lexical representations is clear from their low performances on FeTaQA and TabFact, which do not contain descriptive metadata. LLM-generated table summaries with dense metadata embeddings can significantly improve retrieval performance as illustrated by the Dense Metadata Embedding baseline. Generally, Dense Table Embeddings (table header + rows) generally yield the best performance. Notably, for both text-to-SQL datasets, the effect of including data rows is minimal, with differences within ±5% in recall. The Dense Row-level Embedding method exhibits comparable performances to dense embeddings of tables with sampled rows, but becomes impractical for large tables, such as in BIRD. Further analysis regarding scaling, context limits, and effect of retrieval, can be found in the paper.

    Retrieval results of the TARGET benchmark for various retrievers, tasks and datasets

    Generation
    In general, the ``No Context'' baseline performs significantly lower without having relevant tables provided, illustrating the value of table retrieval (e.g. from Wikipedia) for grounding LLMs. The low performance of all retrievers on the OTTQA dataset is notable, which we hypothesize is due to the relatively short answers in OTTQA versus longer generated answers despite prompting for concisenes, and illustrates the need for more robust evaluation metrics. Naturally, due to the stronger retriever performance of dense embeddings, we find that dense retrievers generally yield best downstream performance across datasets. Meanwhile, the poor retrieval performance of sparse lexical representations on FeTaQA seems to distract the generator with irrelevant tables. To understand the limitations of LLM context for table comprehension tasks, we explore the relationship between the rank of the ground-truth table in the retrieval results and downstream task performance and find that the accuracy significantly decreases when the correct table is lower positioned in context (more details are in the paper).

    Generation results of the TARGET benchmark for various retrievers, tasks and datasets

    Contact

    We warmly welcome contributions and suggestions for TARGET, please find instructions at: https://github.com/target-benchmark/target. Want to share/discuss something else, please reach out to Madelon Hulsebos (madelon@cwi.nl)!

    Citation

    @inproceedings{ji2024target,
      title={TARGET: Benchmarking Table Retrieval for Generative Tasks},
      author={Ji, Xingyu and Parameswaran, Aditya and Hulsebos, Madelon},
      booktitle={NeurIPS 2024 Third Table Representation Learning Workshop}
    }