🎯 TARGET: Benchmarking Table Retrieval for Generative Tasks

UC Berkeley
*Now at CWI
Overview diagram of the TARGET benchmark for evaluating table retrieval for generative tasks

Overview of the TARGET benchmark.

Abstract

The data landscape is rich with structured data, often of high value to organizations, that drive important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those that leverage text-to-SQL. Contextualizing interactions, including conversational and agentic elements, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question, however, is: how do we retrieve the right table(s) for the analytical query or task at hand? To investigate this question, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. We use TARGET to analyze the retrieval performance of dif ferent retrievers in isolation, as well as their impact on downstream generators for question answering, fact verification, and text-to-SQL. We find that out-of-the-box embedding-based retrievers far outperform a BM25 baseline which appears less effective than it is for retrieval over unstructured text. We also surface the sensitiv ity of retrievers across various metadata (e.g., missing table titles), and illustrate a stark variation of retrieval performance across datasets and tasks. TARGET is developed for easy reuse and extension to advance research on retrieval methods and pipelines for relational data through fine-grained, comprehensive, and consistent evaluation.

Key findings (initial)

Main results of the TARGET benchmark for various retrievers, tasks and datasets
Retrieval
We find that lexical methods based on BM25 and TF-IDF are less effective than they are for text, even with increased k. The high performance of these methods on the OTTQA dataset appears to be mainly driven by the high correspondence between Wikipedia table title and queries, as the performance drops when the title is left out (Table 2). We observe a similar pattern for these methods on the text-to-SQL tasks when table names are not included, which is further confirmed with results on FeTaQA, where the table titles are not descriptive and including them does not enhance performance. These findings emphasize the potential critical role of table metadata. Embeddings of table headers and rows generally yield the best performance. LLM-generated table summaries with LlamaIndex results in lower retrieval performance and efficiency than the direct table embedding pipeline, but generating descriptive table titles in place of non-descriptive ones (e.g., FeTaQA) can enhance retrieval performance. For both text-to-SQL datasets, including data rows in the embedding actually lowers retrieval performance.

Generation
Unsurprisingly, we observe that providing database schemas for text-to-SQL is critical to generate accurate SQL queries, as the No Context baseline yields an accuracy of 0. The low performance of all retrievers on the OTTQA dataset is also notable, which we hypothesize is due to the relatively short answers in OTTQA versus longer generated answers despite prompting for conciseness. Overall, we find that dense embeddings yield better retrieval performance. Notably, for the fact verification task, the precision and recall with OpenAI embeddings are significantly higher than when evaluating the statements without context, i.e., using only the memory of the LLM, underlining the value of grounding LLMs conversations in factual structured data. When we exclude all “not enough information” responses, we find that the recall across all retrievers increases to approximately 0.747, which confirms the impact of incorporating relevant tables into the context.

Contact

We warmly welcome contributions and suggestions for TARGET, please find instructions at: https://github.com/target-benchmark/target. Want to share/discuss something else, please reach out to Madelon Hulsebos (madelon@cwi.nl)!

Citation

@inproceedings{ji2024target,
  title={TARGET: Benchmarking Table Retrieval for Generative Tasks},
  author={Ji, Xingyu and Parameswaran, Aditya and Hulsebos, Madelon},
  booktitle={NeurIPS 2024 Third Table Representation Learning Workshop}
}