How does the pipeline select which combinations of NASA datasets are interesting for hypothesis generation?

A heterogeneous graph neural network (GNN) was trained on historical co-usage patterns of datasets in the literature. The GNN ranks candidate pairings by the probability that they would together lead to meaningful analysis — only those top-ranked pairs enter the LLM pipeline.

Can a single LLM judge be trusted to assess the quality of generated hypotheses?

No. A key finding of the paper is that absolute hypothesis scores vary significantly depending on which judge model evaluates them, while the relative ranking remains somewhat consistent. The authors conclude that single-judge LLM evaluation is unreliable and recommend a multi-metric approach.

Which scientific domains are covered by the generated hypotheses?

The pipeline produced hypotheses in ecohydrology, glaciology, aerosol-cloud interactions, vegetation phenology, and stratospheric chemistry — demonstrating the breadth of NASA datasets as a starting foundation.

EO-Agents: 160 NASA Hypotheses via Three-Agent LLM Pipeline

Researchers developed a three-agent LLM pipeline that uses a NASA Earth Observation Knowledge Graph and a GNN to rank dataset pairs, then automatically generates research hypotheses across glaciology, vegetation phenology, and other domains.

A research team consisting of Mahyar Ghazanfari, Amin Tabrizian, Armin Mehrabian, and Peng Wei presented at the ICML 2026 AI for Science Workshop a system combining graph neural networks and a three-agent LLM pipeline to automatically generate research hypotheses from NASA’s Earth observation datasets.

From Knowledge Graph to Hypothesis

The starting point of the pipeline is the NASA Earth Observation Knowledge Graph — a structured database covering 1,475 NASA datasets from diverse domains: from glaciology and ecohydrology to stratospheric chemistry and vegetation phenology.

The sheer size of this space makes manual exploration impractical. The number of possible dataset pairs grows quadratically, and no researcher has time to consider every combination. This is where a heterogeneous graph neural network (GNN) enters — trained on historical co-usage patterns of datasets in scientific literature, that is, on pairs that have already proven fruitful in published work. The GNN ranks candidate pairings by the probability that they could together lead to meaningful analysis, and only top-ranked pairings enter the LLM pipeline.

Three-Agent Pipeline: Filter, Generate, Evaluate

The architecture of the LLM portion follows a logical division of responsibilities. The filtering agent receives ranked dataset pairs and discards those that do not meet minimum thematic coherence. The generation agent formulates a structured research hypothesis for each remaining pair — describing what phenomena the combination of those datasets might explain, by what methodology, and what contribution it could make. The evaluator agent scores each hypothesis and provides feedback that can trigger revision.

Applied to 1,475 NASA datasets, the pipeline produced 160 scientific hypotheses distributed across domains including glaciology, vegetation phenology, ecohydrology, aerosol-cloud interactions, and stratospheric chemistry.

Did the System Actually Discover Something New?

The central evaluation question is how to measure the quality of automatically generated hypotheses. The authors used expert evaluators who compared model-predicted novel pairings against “ground truth” pairings that had actually appeared in the literature but were withheld from GNN training.

Result: model-predicted novel pairings were rated as “nearly as compelling” as actual co-usage pairs from the literature. This suggests the GNN successfully captures meaningful structure in the dataset space, rather than merely learning surface correlations.

A Warning About Single-Judge Evaluation

Perhaps the most important methodological finding of the paper is not related to hypothesis detection but to their evaluation. In a factorial experiment, the authors compared performance across different LLMs serving as judges — and discovered a concerning pattern: while the relative ranking of hypotheses remains somewhat consistent across models, absolute scores vary significantly depending on which judge model does the evaluating.

This confirms a broader concern in the ML community: when a single LLM is used as the sole judge in evaluation, results are biased toward the characteristics of that model — toward what it considers a “good hypothesis.” The authors conclude that reliable evaluation requires multiple metrics and multiple judges, rather than a single-judge approach. This methodological caveat is not a side note — the authors present it as an equally important contribution of the paper as the hypothesis generation pipeline itself.

Why Automating Dataset Pairing Is Valuable

The space EO-Agents covers is far from trivial. NASA’s datasets come from different instruments, time spans, and spatial resolutions — satellite ocean temperature data, ice cover imagery, vegetation spectral data. A researcher specializing in one domain may never be aware of datasets that exist in another domain, yet could enrich or validate their analysis. A GNN that learns from co-usage patterns provides that cross-domain visibility automatically.

Scope of Application

The paper was accepted at the ICML 2026 AI for Science Workshop — signaling relevance to the community investigating LLM applications in scientific disciplines. However, the system in its current form generates hypotheses — it does not verify them. Every generated hypothesis still requires human expertise to assess feasibility and data validation.

For institutions like NASA managing thousands of heterogeneous datasets, such a system can be a valuable tool for uncovering unnoticed connections between datasets that have previously been isolated within separate research communities.

EO-Agents: Three-Agent LLM Pipeline Generated 160 Scientific Hypotheses from 1,475 NASA Datasets

From Knowledge Graph to Hypothesis

Three-Agent Pipeline: Filter, Generate, Evaluate

Did the System Actually Discover Something New?

A Warning About Single-Judge Evaluation

Why Automating Dataset Pairing Is Valuable

Scope of Application

Frequently Asked Questions

Sources

Related news