Tether has released QVAC Genesis II, a major expansion of its open synthetic educational dataset designed for training artificial intelligence models. The update increases the total size of the dataset to 148 billion tokens, positioning it as one of the largest publicly available synthetic datasets focused on structured learning.
The dataset is developed by Tether’s AI research arm QVAC, which focuses on producing high-quality, non-scraped data for model training. According to the company, the new release adds 107 billion tokens to the original Genesis I dataset.
QVAC Genesis II expands coverage from nine to 19 educational domains. Newly added areas include chemistry, computer science, machine learning, statistics, econometrics, astronomy, geography, and electrical engineering, alongside an expanded college-level physics section.
Focus on reasoning and transparency
The new dataset also introduces an “option-level reasoning” methodology. This approach systematically explains why each possible answer in a multiple-choice question is correct or incorrect, aiming to improve reasoning depth and reduce shortcut learning in large language models.
Tether said the dataset is intended to support researchers, educators, and developers seeking transparent and reproducible AI training data. The company stressed that the synthetic nature of the data avoids copyright concerns associated with web-scraped corpora.
QVAC Genesis II is released under a Creative Commons Attribution NonCommercial license and is available through open AI research platforms. The move aligns with a broader industry push toward open datasets as regulators and policymakers scrutinise how AI models are trained and governed.
For Europe, the release highlights how crypto and fintech firms are increasingly investing in foundational AI infrastructure, an area that is drawing growing attention from EU regulators as artificial intelligence becomes more embedded in financial services and payments.
