Foundation models (FMs), pre-trained on large-scale reanalysis datasets using self-supervised learning, offer a transformative path forward for global weather forecasting. This study presents a systematic, out-of-the-box evaluation of multiple FM architectures—including Transformer-based (Pangu, Prithvi, Aurora, GenCast), Graph Neural Network (AIFS), and diffusion-based (GenCast) models—focusing on their ability to predict key atmospheric variables across multiple pressure levels and lead times. Our goal is to support Earth system scientists in responsibly integrating AI into research workflows by establishing a rigorous FM evaluation framework. We emphasize four pillars critical to the scientific adoption of FMs: adaptability (across sensors and tasks), accessibility (software availability and runtime efficiency), trust (interpretability and stable outputs), and validation (scientific benchmarking and reproducibility). All models were deployed on NASA’s Discover HPC system using pre-trained weights and sample-formatted inputs to assess accuracy, infrastructure readiness, and usability. Initial results show GenCast leading in mid-range forecasts for state variables like geopotential height, winds, temperature, and specific humidity, albeit with tradeoffs: fewer output levels, 12-hour resolution, and higher compute requirements. We compare all FM forecasts to GEOS-FP, NASA's operational weather forecast produced by the Global Modeling and Assimilation Office. This work establishes benchmark performance baselines for FM-based forecasting and sets expectations for their operational applicability. Our evaluation framework will be made available to NASA Goddard scientists in Summer 2025 to enable rapid testing of current and future models. Longer-term, this effort contributes to a broader vision of accessible, trustworthy, and domain-adaptable foundation models for Earth system science.