✨ TL;DR
This paper introduces the first comprehensive dataset suite covering the complete lifecycle of decentralized prediction markets on Polymarket, integrating fragmented on-chain and off-chain data from October 2020 to March 2026. The unified dataset enables systematic analysis of market dynamics and supports downstream applications like outcome calibration and economic expectation reconstruction.
Prediction markets provide valuable signals of collective beliefs about future events, but in decentralized platforms like Polymarket, data are fragmented across heterogeneous off-chain and on-chain sources. The market lifecycle spans multiple stages—creation, token registration, trading, oracle interaction, dispute, and settlement—yet no unified dataset exists that captures this complete journey. This fragmentation makes it difficult to conduct large-scale analyses of market behavior and limits the ability to leverage prediction market data for downstream applications.
The authors built a unified relational data system that integrates three canonical layers: market metadata, fill-level trading records, and oracle-resolution events. They addressed key technical challenges through identifier resolution (linking records across sources), on-chain recovery (reconstructing missing data from blockchain), and incremental updates (maintaining synchronization). The system was designed to be reproducible and extensible, with explicit consistency mechanisms to ensure data quality across the integrated sources.
What the paper shows.
The resulting dataset comprises more than 770 thousand market records, over 943 million fill records, and nearly 2 million oracle events spanning October 2020 to March 2026. The authors demonstrate utility through descriptive analyses of market activity patterns and two case studies: NBA outcome calibration (showing how market prices predict sports outcomes) and CPI expectation reconstruction (extracting inflation expectations from market data). The system successfully maintains data consistency and reproducibility across continuous updates.
The dataset is limited to Polymarket, a single decentralized prediction market platform, which may not generalize to other prediction market implementations or centralized platforms. The paper does not discuss potential biases in market participation or coverage gaps in oracle resolution events. Temporal coverage begins in October 2020, potentially missing earlier market history. The paper does not thoroughly address data quality issues, missing values, or the accuracy of on-chain recovery mechanisms for incomplete records.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.