Enabling data-driven modeling and real-time prediction of the dynamic solar atmosphere Decadal Survey for Solar and Space Physics (Heliophysics) SSPH 2024-2033

.


SYNOPSIS:
Cutting edge research of the region between the solar photosphere and Alfvén surface, where the solar wind disconnects from the solar surface, increasingly relies on largescale computational modeling.This is true not just for numerical experiments that only involve simulations, but also for interpreting the ever-more-intricate observations coming out of new facilities.A paradigm shift is currently underway in the field, away from ad hoc and ab initio models of the atmosphere and towards simulations that are directly driven by observations.While this topic is currently at the level of basic research into the techniques, the intent is for it to serve both basic research and operational space-weather needs.We must advance data driven simulations and their supporting infrastructure to the point where community members can use time dependent dynamic models on a regular basis as a tool to interpret observations and test physical theories.Because of the complexity of the task, the immense computational resources required, and the required longevity of such a project, we argue that this effort should be strategically funded.Such an effort is required to make the most out of current and upcoming multi-mission, multi-instrument, heterogeneous observational data.
• Dedicated funding is required for collaborative model development that allows for continuous community use, support, and contribution, separate from the application of models to specific scientific analyses and papers • Data and computational resources should be hosted within the same facility to provide tightly integrated dynamic modeling, forward modeling, and data analysis.This facility needs to combine easy access to simulation and observational data and easy access to open-source codes that can be run as-is or modified as-needed • The broad scope of data driven methods require the combination of 4π steradian spectropolarimetric observations for self-consistent global models and high temporal/spatial resolution, multi-height observations for detailed studies of local physical processes

State of the art
Multi-wavelength observations in the solar photosphere, chromosphere and lower solar corona provide the inner dynamic boundary condition for the heliosphere.Understanding the 3D dynamic state of the magnetic field and plasma parameters requires models that combine individual observations (typically constraining only a small fraction of the volume in a thin radial layer) with models that provide a volume filling global context.The main open questions in the physics of the solar atmosphere -how the atmosphere above the temperature minimum region is heated, how and where the various types of solar wind arise, and what mechanisms trigger individual eruptions as flares and coronal mass ejections (CMEs) -all require knowledge of the 3D structure of the magnetic field and plasma properties above the photosphere.At present, that knowledge is severely limited.As just one demonstration, the current status of solar eruption forecasting makes clear that there is a massive overlap in the parameter spaces of eruptive and non-eruptive active regions [4,5]; this implies that we fundamentally do not know what the 3D magnetic or thermodynamic structure of the corona is and therefore have limited ability to assess its stability.There are primarily two groups of models that simulate evolving coronal magnetic fields in 3D: quasi-static (or time-independent) and dynamic (time-dependent).In the quasi-static group, the potential, linear, and non-linear force-free field (NLFFF) extrapolations have been developed.These models apply the vacuum-limit assumption, which assumes that magnetic pressure dominates the gas pressure (low-β regime).Another group of models, that use fewer approximations, are time-dependent models.The time-dependent models come in a variety of setups.Smallerdomain but high-resolution fluid models, such as Bifrost [6,7] and MURaM [8,9,10], have the most comprehensive physics and allow for forward modeling of observables from visible to EUV and soft X-ray wavelengths, but are computationally expensive.Large-scale models, including global fluid models [1], have lower resolution and simplified physics, but allow for the modeling of the evolution of individual eruptions originating within the global solar corona out to large radial distances from the Sun [11].The ingestion of observations in all models is critical for the modeling of actual solar events.The research focus is currently moving rapidly from the domain of data-inspired models (where the initial setup mimics certain properties observed on the Sun) to data-driven models that rely on routine vector magnetic field observations from HMI/SDO and use new techniques to derive electric fields or plasma velocities.How to incorporate the increasing number of multi-height observations is at this point an open research topic.The future requires a focused community plan to make the most of the deluge of heterogeneous data coming from a number of sources, notably the NSF's new flagship ground-based visible and infrared Inouye Solar Telescope [12], the corona-specific CoMP [13] and COSMO [14] facilities, and radio observations from, e.g., ALMA [15] and EOVSA [16] (see also the FASR white paper [17]).
Currently, magnetohydrodynamic (MHD) models of the coupled photosphere-corona atmosphere, such as those that study CME/flare events, typically derive boundary conditions for the magnetic field from the observed photospheric magnetograms and produce the pre-eruptive configuration using (1) some form of boundary driving of magnetic flux, such as flux emergence, shear flows, and helicity condensation [18,19,20,21], (2) nonlinear force-free field (NLFFF) extrapolations [22], or (3) analytical flux-rope models that are inserted into the source region of the eruption [11].In many cases, the lower boundary driving and/or the construction of the force-free field and the inserted flux ropes are still largely ad hoc and not well constrained by either the observations or the physics of the driving layer.As a result, generally only qualitative agreement is obtained between the modeled magnetic field evolution and the observed event.Truly quantitative models of flare and CME events that are well constrained by observations are yet to be developed.Another challenge is bridging the gap between physically required and numerically feasible resolution and cadence.Lately, two types of data-driven models have been developed for this purpose: magneto-friction (MF) [18] and MHD models [24].The MF model assumes that the plasma velocity in the induction equation is proportional to the local Lorentz force; the subsequent plasma evolution leads to a relaxation of a magnetic configuration toward a force-free state.MF is more computationally efficient than MHD and is suitable for description of the slow quiescent evolution of active regions (ARs), but not for modeling of the eruptions.The MHD models explicitly solve a full set of MHD equations including the plasma properties.The MHD approach is suitable for modeling the rapid evolution of ARs during eruptions but is currently too computationally expensive to model their long-term quiescent evolution.A hybrid framework, where the MF model is used to model quiescent periods of AR evolution and the MHD model is used to model flaring periods of AR evolution, has been recently developed within the Coronal Global Evolutionary model (CGEM [23], see Figure 1).Other global approaches use locally concentrated or adaptively refined grids (e.g. the SWMF [25]).The future lies in data-driven models that can be implemented through: (1) boundary driving, the use of temporal sequences of photospheric electric fields (derived from vector magnetograms to represent the realistic flux transport) at the lower boundary for a time-dependent coronal field model (e.g.[26]); and (2) data assimilation, the use of temporal sequences of observations for updating the physical state of a model through statistical methods such as Ensemble-Kalman filters (EnKF).
Data Assimilation (DA) is widely used and well established in the Earth atmospheric community (e.g. to forecast the weather); however, the use in solar physics is currently limited to applications of solar-cycle forecasting [27,28].The full implementation of DA (through EnKF) in MHD models is a step beyond boundary driving which provides the following advantages: (1) evolution of an ensemble model that can account for observation uncertainties and calculate model errors; (2) correction of the full model state in response to new observations; (3) the ability to assimilate a wide variety of observations, including remote sensing and in-situ observations, not limited to just the lower boundary of the system.However, EnKF DA entails substantial computing costs due to requiring (1) adequate ensemble runs and (2) forward modeling of observables from physical model-outputs to compare the model with real observations at every assimilation-step.

Current roadblocks and how to clear them
In the next decade we must support the development of time-dependent data-driven simulations using physically consistent boundary conditions and in-the-volume assimilation that treat both the magnetic and plasma variables.Data-driven MHD following a general radiative fluid or multi-fluid approach is necessary because: (i) Static extrapolations lack sufficient physics and constraints to be correct in the sense that an inferred coronal state cannot confidently be stated to be near the true state.Such extrapolations cannot distinguish between, for instance, proposed theories for solar eruptions.Further, time series of extrapolations are not causally connected and therefore allow changes in the 3D state from one time to the next that are disallowed by any physical process.(ii) Magneto-frictional methods also lack correct physics, i.e. they omit the material plasma and include non-physical evolution, even though they do provide a major improvement on static models via built-in hysteresis.(iii) Spectropolarimetric data require sophisticated inversions to infer the 3D magnetic and plasma structure in both the lower atmosphere [29] and the corona [30].Currently, inversions almost exclusively assume independent 1D radiative transfer problems without reference to or knowledge of dynamically consistent 3D models.While recent efforts try to include spatial coupling [31] or infer plasma parameters on a spatial rather than optical depth grid [32,33], in practice, performing an inversion will always be a highly model-dependent problem from two different perspectives: first, it depends upon physical models that allow inversion of the combined diagnostics from broadband emission and polarized spectra to recover the physical characteristics of the emitting plasma; and second, computational extrapolation models must be used to fill in gaps in the observational data: spatially, temporally, and in terms of sensitivity to each plasma property (magnetic field, density, temperature, ionization degree, etc.).Data-driven simulations, which are dynamically constrained by the underlying equations of motion, are a powerful tool to drastically reduce the uncertainties in both types of model dependencies.(iv) Currently, the inner boundary conditions and initial state of both global forecasting models and smallerscale detailed modeling of individual events are mostly ad hoc.For example, CMEs inserted into a background solar wind to simulate space weather events are either initially unstable or are ad hoc driven until they erupt [11].Instead, data-driving provides a scaffolding on which to hang multi-scale studies which must accurately capture and exchange the effects of physical processes occurring at each of the scales in the multi-scale simulation suite, e.g., providing the first layer of the linked series of models in the white paper by Allred et al. [34].Physically consistent boundary conditions capture physics at larger scales outside each subset of the simulation suite, and data assimilation captures physics at sub-grid scales.Having dynamical, data constrained models will provide realistic inner boundary conditions for the heliosphere and initial conditions for more detailed local models.
To move beyond the limitations just described, time dependent data-driven models are required, but only possible given the confluence of 3 critical elements: • The availability of data (both remote sensing and in-situ) with a stable quality, time-duration, spatial coverage, and cadence, that constrain enough of the MHD state vector; • MHD models with a sufficient sophistication in terms of physics and their ability to simulate a solar-like setup in terms of domain extent and time-scales; • The ability to ingest these data into models and to continuously update and correct the model state to reflect the observed conditions on the Sun.

Data Requirements, Observing Facility requirements
We need to have data that are capable of both feeding simulations and assessing their performance.The recent progress in data-driven models has relied critically on routine vector magnetic field observations from the HMI/SDO and new approaches to derive electric fields (or plasma velocities) from these observations.The need for such observations will only increase once full data assimilation approaches are incorporated into MHD models, and new methods must also include better constraints on the thermodynamic variables; uncertainties, in particular systematics in current and future observations, will need to be better quantified.Future routine observations of vector magnetic fields in the chromosphere and transition region (e.g., ngGONG [35]) can provide the much needed additional constraints within the volume, particularly of the pre-eruption chromospheric and coronal magnetic field [36].Global operational models require continuous observations of the vector magnetic field and at least one thermodynamic variable over the whole 4π surface of the Sun.Future space missions need to focus on multi-spacecraft constellations mapping out a larger area of the heliosphere [37].

Model requirements
• Improving models such that large-scale simulations of the solar atmosphere allow for a detailed modeling of processes and the forward modeling of observables through a combination of implemented physics and numerical resolution.Forward modeling of spectropolarimetric observables is often done as a post processing step that is poorly integrated with MHD models and is a large bottleneck for individual researchers.This bottleneck includes both easy access to extensive MHD datasets for a broad swath of the community as well as the computational resources required for the most expensive and diagnostic-rich forward models.The FORWARD [38] project was an early example of building the forward modeling side of this endeavor, but isn't directly coupled with a broad variety of simulations.
• The full implementation of data assimilation to ingest a wide range of remote sensing and in-situ observations from heliospheric observatories.Unlike boundary driven models, data assimilation can ingest a multitude of simultaneous observations within the entire simulation volume and account for observation uncertainties and calculate model errors.While data assimilation is widely used in models of Earth's atmosphere, it has rarely been explored and applied in the context of the solar atmosphere.
• The use of accelerator technologies such as GPUs and physics parameterization utilizing machine learning to boost computation speed.This will allow models to run faster than real time in order to enable research on a large number of observed events and allow for operational space weather modeling.These developments are only possible if the field widely adopts the latest computing technologies (such as the use of accelerators in form of GPUs) and stays on the forefront of new developments.While some models have been refactored for GPU use (e.g.MAS, [39]) or are in the process of refactoring (e.g.MURaM, [40]), the field of solar physics overall is already 1 decade behind the curve in adopting GPU computing and other emerging exa-scale technologies.We need targeted support for this transition in a way that allows continued in-community development of new codes through a culture change in training software engineers and domain scientists in the use of exa-scale technology.

The need for community coordination
Building a simulation framework like we propose is too great an effort for a single-PI, universityscale research program.There is broad interest in these tasks within the community, as evidenced by a number of recent workshops and conference sessions : "Model-Coupling workshop" (Boulder 2018); "Data-Driven Models of the Solar Progenitors of Space Weather and Space Climate" (Nagoya 2018); "Data-driven 3D Modeling of Evolving and Eruptive Solar Active Region Coronae" (ISSI 2022); and numerous sessions at the TESS, SPD, AGU, and COSPAR meetings.However, progress is currently somewhat piecemeal.What is lacking is a concerted, communityinformed effort to decide the best way to implement the tight integration of data driven modeling with observational data analysis.We therefore argue that this capability should be coordinated and supported at the national level.
• The primary development of many commonly used codes in solar physics is based in Europe (e.g.MHD codes such as BiFROST [6,7,43], MURAM [8,9,40,44], MPI-AMRVAC [45], MANCHA3D [46], CO5BOLD [47], and inversion codes such as STiC [48], NICOLE [49], HAZEL [50], DeSIRe [51]).Therefore the European community is in a better position to fully take advantage of the latest observations, especially the copious amount of spectropolarimetric data just starting to come on line from DKIST [12].If the US wants to stay at the forefront of solar research in the coming decade, studying coupled dynamics of the solar atmosphere, including the chromosphere, is crucial.This requires sophisticated multi-fluid, multi-species codes with non-equilibrium ionization coupled to global scales.In order to stay competitive the US must support the development of data driven codes, not just their use in specific analyses.
• The massive amounts of observational and simulation data, each of which require large computational resources to produce and analyze, need to be stored at the same location.As an example, during a recent campaign, several hours of observations from the DKIST generated roughly 20Tb of data.Handling this amount of data efficiently, and cross-comparing to a similar amount of simulation data, is an extraordinarily challenging problem beyond what can or should be required of individual researchers: models need to be tied to the evolving requirements of facilities that are supported at the national level.
• The "joined at the hip funding" of computational modeling and observations was a massive success for the IRIS/BiFROST spacecraft and simulation effort.Later, forward modeling through numerous numerical models (BIFROST, MURaM, RADYN [52,53]), was explicitly used in the design phase of the recently selected MUSE mission [54,55].Radiation MHD simulations were used in the design phase of DKIST instrumentation to simulate instrument performance.Similar capabilities should be specifically supported for new facilities in order to fully realize their individual and collective potential.

Final Statement
The developments listed in §3.3 are critical to enable a comprehensive investigation of physical processes in the coupled solar atmosphere.They are required to make the most of combined observations from multiple observatories and instruments in a consistent dynamical context given by data-driven MHD models.Self-consistently driven models will answer fundamental questions about the solar atmosphere: what is its 3D structure, how did that structure arise, and is it stable?What does its evolution tell us about the Sun's internal dynamo and how it couples outward to the solar wind?Ultimately, these models will allow the science-to-operations transition and enable us to make the jump from flare and CME forecasting based on empirical relations to those derived directly from ensemble modeling around a dynamically constrained initial condition, thus giving the physical basis for eruption probability and timing, estimation of CME strength, speed, and magnetic field orientationthe vital elements of space weather prediction that will allow humanity's continued exploration of the heliosphere.

Figure 1 :
Figure 1: Cartoon overview of the data-driving and assimilative modeling framework.The lower level represents the continuous boundary driving data upon which both global (center,left) and detailed local (right) simulations are based.Additional images show recent examples of forward modeling through such simulations.Closing the feedback loops between modeling and observations is required both to interpret the latest observations via inversions and realize true, in-the-volume data assimilation.Data and modeling adapted from [1, 2, 3]; solar eclipse image in upper left © 2017 Miloslav Druckmüller, Peter Aniol, Shadia Habbal.

Figure 2 :
Figure 2: Left: Synthetic emission proxy from a data-driven radiative zerobeta MHD simulation of active region (AR) 11158 derived with the Coronal Global Evolutionary Model (CGEM).The run used electric fields inverted from HMI observations as a photospheric driver.Right: Observed emission of AR 11158 in AIA 131A high-temperature channel before the X2.2 flare.Adapted from[23].