## No full-text available

To read the full-text of this research,

you can request a copy directly from the author.

Deterministic predictions of tropical cyclone (TC) intensity from operational forecast systems traditionally have been verified with a summary accuracy measure (e.g., mean absolute error). Since the forecast system development process is coupled to the verification procedure, it follows that TC intensity forecast systems have been developed with the goal of producing predictions that optimize the chosen summary accuracy measure. Here, the consequences of this development process for the quality of the resultant forecasts are diagnosed through a distributions-oriented (DO) verification of operational TC intensity forecasts. DO verification techniques examine the full relationship between a set of forecasts and the corresponding set of observations (i.e., forecast quality), rather than just the accuracy attribute of that relationship.
The DO verification results reveal similar first-order characteristics in the quality of predictions from four TC intensity forecast systems. These characteristics are shown to be consistent with the theoretical response of a forecast system to the imposed goal of summary accuracy measure optimization: production of forecasts that asymptote with lead time to the central tendency of the observed distribution. While such forecasts perform well with respect to the accuracy, unconditional bias, and type I conditional bias attributes of forecast quality, they perform poorly with respect to type II conditional bias. Thus, it is clear that optimization of forecast accuracy is not equivalent to optimization of forecast quality. Ultimately, developers of deterministic forecast systems must take care to employ a verification procedure that promotes good performance with respect to the most desired attributes of forecast quality.

To read the full-text of this research,

you can request a copy directly from the author.

... Verifying the above conditional and marginal distributions is equivalent to verifying the joint distribution. For instance, given two sets of forecasts, f 1 and f 2 , by comparing p x f ( ) 1 and p x f ( ) 2 , one can conclude whether one set of forecasts is more reliable than the other; see Moskaitis (2008), Murphy et al. (1989) for case studies. Whereas linking the forecast distributions to aspects of forecast quality provides forecasters with insights regarding their forecasts, such results are easier to interpret if the different aspects of forecast quality can be quantified using measures. ...

... A derivation of these decompositions is shown in Moskaitis (2008). As annotated above the equations, different terms in the decomposed forms explain different aspects of forecast quality. ...

... When Murphy and Winkler (1987) proposed these decompositions, a binary x was used in their case study, which greatly simplifies the computation. In Moskaitis (2008), the evaluation was performed by discretizing the continuous random variable-tropical cyclone intensity-into bins. Recently, Yang and Perez (2019) used kernel conditional density estimation (KCDE) to estimate the conditional expectations, namely, x f ( ) and f x ( ), which removes the dependency on binning. ...

The field of energy forecasting has attracted many researchers from different fields (e.g., meteorology, data sciences, mechanical or electrical engineering) over the last decade. Solar forecasting is a fast-growing sub- domain of energy forecasting. Despite several previous attempts, the methods and measures used for verification of deterministic (also known as single-valued or point) solar forecasts are still far from being standardized, making forecast analysis and comparison difficult.
To analyze and compare solar forecasts, the well-established Murphy–Winkler framework for distribution-oriented forecast verification is recommended as a standard practice. This framework examines aspects of forecast quality, such as reliability, resolution, association, or discrimination, and analyzes the joint distribution of forecasts and observations, which contains all time-independent information relevant to verification. To verify forecasts, one can use any graphical display or mathematical/statistical measure to provide insights and summarize the aspects of forecast quality. The majority of graphical methods and accuracy measures known to solar forecasters are specific methods under this general framework.
Additionally, measuring the overall skillfulness of forecasters is also of general interest. The use of the root mean square error (RMSE) skill score based on the optimal convex combination of climatology and persistence methods is highly recommended. By standardizing the accuracy measure and reference forecasting method, the RMSE skill score allows—with appropriate caveats—comparison of forecasts made using different models, across different locations and time periods.

... Despite the widespread use of MAE and MAE skill in operational and TC-community verification, alternative approaches to TC verification that use the NHC-based verification methodology have appeared in peer-reviewed literature. For example, some have evaluated the entire distribution of errors when verifying TC forecasts by calculating the median absolute error (MDAE), confidence intervals, and analyzing boxplots of the distribution at each lead time (e.g., Powell and Aberson 2001;Moskaitis 2008;Galarneau and Davis 2013;Alaka et al. 2020;Sippel et al. 2021). Even further, the WMO report on verification methods for TCs describes additional metrics that can be used, such as the interquartile range, the root-mean-square error, and correlation coefficients (WMO 2013, their Table 3). ...

This paper introduces a new tool for verifying tropical cyclone (TC) forecasts. Tropical cyclone forecasts made by operational centers and by numerical weather prediction (NWP) models have been objectively verified for decades. Typically, the mean absolute error (MAE) and/or MAE skill are calculated relative to values within the operational center’s best track. Yet, the MAE can be strongly influenced by outliers and yield misleading results. Thus, this paper introduces an assessment of consistency between the MAE skill as well as two other measures of forecast performance. This “consistency metric” objectively evaluates the forecast-error evolution as a function of lead time based on thresholds applied to the 1) MAE skill; 2) the frequency of superior performance (FSP), which indicates how often one forecast outperforms another; and 3) median absolute error (MDAE) skill. The utility and applicability of the consistency metric is validated by applying it to four research and forecasting applications. Overall, this consistency metric is a helpful tool to guide analysis and increase confidence in results in a straightforward way. By augmenting the commonly used MAE and MAE skill with this consistency metric and creating a single scorecard with consistency metric results for TC track, intensity, and significant-wind-radii forecasts, the impact of observing systems, new modeling systems, or model upgrades on TC-forecast performance can be evaluated both holistically and succinctly. This could in turn help forecasters learn from challenging cases and accelerate and optimize developments and upgrades in NWP models.
Significance Statement
Evaluating the impact of observing systems, new modeling systems, or model upgrades on TC forecasts is vital to ensure more rapid and accurate implementations and optimizations. To do so, errors between model forecasts and observed TC parameters are calculated. Historically, analyzing these errors heavily relied on using one or two metrics: mean absolute errors (MAE) and/or MAE skill. Yet, doing so can lead to misleading conclusions if the error distributions are skewed, which often occurs (e.g., a poorly forecasted TC). This paper presents a new, straightforward way to combine useful information from several different metrics to enable a more holistic assessment of forecast errors when assessing the MAE and MAE skill.

... For case studies of single-storm events, deterministic models are often used for research and model improvement purposes. Deterministic predictive models require accurate meteorological inputs such as TC position, pressure, wind speed and timing which is achievable in hindcasting a TC post-occurrence (Moskaitis 2008). ...

Tropical cyclones (TCs) are dangerous and destructive natural hazards that impact population, infrastructure, and the environment. TCs are multi-hazardous severe weather phenomena; they produce damaging winds, storm surges, and torrential rain that can lead to flooding. Identifying regions most at risk to TC impacts assists with improving preparedness and resilience of communities. This study presents results of TC multi-hazard risk assessment and mapping for Queensland (QLD), Australia. Datasets from Global Assessment Report (GAR) Atlas were used to evaluate TC hazards. Data for exposure and vulnerability of population, infrastructure and the environment were sourced from agencies such as the Australian Bureau of Statistics. TC hazards of storm surges, floods, and winds were analysed individually. Combining risk indices for TC hazards, exposure and vulnerability, overall TC risk index was derived. TC multi-hazard risk maps were produced at the Local Government Area level using ArcGIS, and regions with higher risk of being impacted by TCs were identified. The developed TC multi-hazard risk maps provide disaster risk management offices with comprehensive comparative TC risk profile of QLD that can be used to proactively manage TC risk at the subnational scale.

... All the above concepts can be linked to the standard practice in measure-oriented verification, that is, using summary statistics to describe the goodness of prediction. For example, the MSE between Y and Y * can be decomposed into the following three ways (Yang and Perez, 2019;Moskaitis, 2008;Murphy and Winkler, 1987): ...

This paper aims at merging five gridded products of monthly aerosol optical depth (AOD), namely, MERRA-2, MISR, MODIS-Terra, MODIS-Aqua, and VIIRS, over a period of eight years, from 2012 March to 2020 February. Since these individual products offer alternative realizations of the same underlying AOD process, it is beneficial to study them collectively. To that effect, several parametric and nonparametric regressions, including ensemble model output statistics, quantile regression, quantile regression neural network, and quantile regression forest, are used to optimally combine these products. These regressions generate predictive distributions or quantiles, and thus can be thought of as probabilistic merging or fusion tools. As compared to traditional merging or fusion techniques, which only issue single-valued predictions of AOD, the present ones allow the final AOD product to carry a notion of probability, which is essential for uncertainty quantification. To assess the quality of the final AOD product with respect to that of the individual products, this study employs the important, yet often overlooked, distribution-oriented verification approach. In addition, the calibration and sharpness of the predictive distributions issued by different merging techniques are compared using two strictly proper scoring rules, as well as appropriate graphical tools. Overall, a significant reduction (13%) in root mean square error is achieved by the best merging method (quantile regression forest) compared to the best original dataset (MERRA-2). A significant reduction in bias is also achieved with respect to the MODIS and VIIRS databases, and even more so, to MISR’s.

... The joint distribution verification method introduced by Moskaitis (2008) was used to reveal the overall TC intensity prediction performances of the three centers (not shown). Similar to that study's findings for NHC and model intensity forecasts, our analysis identified a conditional bias that grows with the forecast lead time: the intensity forecasts of the three centers are generally too low for strong TCs and too high for weak TCs. ...

This study systematically evaluates the accuracy, trends and error sources for western North Pacific tropical cyclone intensity forecasts between 2005 and 2018. The study uses homogeneous samples from TC intensity official forecasts issued by the China Meteorological Administration (CMA), the Joint Typhoon Warning Center (JTWC), and the Regional Specialized Meteorological Center Tokyo-Typhoon Center (RSMC-Tokyo). The TC intensity forecast accuracy performances are: 24-48 h, JTWC > RSMC-Tokyo > CMA; 72 h, JTWC > CMA > RSMC-Tokyo; and 96-120 h, JTWC > CMA. Improvements in TC intensity forecasting are marginal but steady for all the three centers. The 24-72 h improvement rate is approximately 1-2 % yr⁻¹. The improvement rates are statistically significant at the 95 % level for almost half of the verification times from 0-120 h. The three centers tend to overestimate weak TCs over the northern South China Sea, but strong TCs are sometimes underestimated over the area east of the Philippines. The three centers generally have higher skill scores associated with forecasting of rapid weakening (RW) events than rapid intensification (RI) events. Overall, the three centers are not skillful in forecasting RI events more than three days in advance. Fortunately, RW events could be forecasted five days in advance with an accuracy order of CMA > RSMC-Tokyo > JTWC.

... Murphy and Winkler (1987) pointed out that this approach furnishes a limited description of the complex relationship between forecasts and observations. Therefore, an alternative approach to intensity forecast evaluation which enables to analyze forecast quality as comprehensively as possible is needed, perhaps as conducted by Moskaitis (2008). Corresponding results will be represented in future literature. ...

The accuracy of eight tropical cyclone (Tc) intensity guidance techniques currently used in the east china Regional Meteorological Center during the 2008 and 2009 western North Pacific seasons has been evaluated within two intensity phases: intensification and decay. In the intensification stage, the majority of the techniques indicated > 60% probabilities of the errors of forecast 12-h intensity change within ±5 m s-1 from the 12-to 60-h forecast intervals, while none had capability to predict the rapid intensification and most of them had a bias toward smaller 48-h intensity changes at the beginning of the stage. The majority of the guidance techniques showed > 70% probabilities of the errors of forecast 12-h intensity change within ±5 m s-1 through 60 h during the decay phase, and the techniques had little capability of predicting rapid decay events. It is found that the evaluated statistical models had difficulty in predicting the strongest cases of decay 36 h after peak intensity, whereas the dynamical and official forecasts were seemingly able to produce some large decay rates.

... Unlike Type-I CB (i.e. E½X j X ¼x Àx), which relates to calibration-refinement factorization (Murphy and Winkler, 1987), Type-II CB is not very amenable to statistical bias correction or post processing (Moskaitis, 2008). ...

A new technique for gauge-only precipitation analysis for improved estimation of heavy-to-extreme precipitation is described and evaluated. The technique is based on a novel extension of classical optimal linear estimation theory in which, in addition to error variance, Type-II conditional bias (CB) is explicitly minimized. When cast in the form of well-known kriging, the methodology yields a new kriging estimator, referred to as CB-penalized kriging (CBPK). CBPK, however, tends to yield negative estimates in areas of no or light precipitation. To address this, an extension of CBPK, referred to herein as extended conditional bias penalized kriging (ECBPK), has been developed which combines the CBPK estimate with a trivial estimate of zero precipitation. To evaluate ECBPK, we carried out real-world and synthetic experiments in which ECBPK and the gauge-only precipitation analysis procedure used in the NWS's Multisensor Precipitation Estimator (MPE) were compared for estimation of point precipitation and mean areal precipitation (MAP), respectively. The results indicate that ECBPK improves hourly gauge-only estimation of heavy-to-extreme precipitation significantly. The improvement is particularly large for estimation of MAP for a range of combinations of basin size and rain gauge network density. This paper describes the technique, summarizes the results and shares ideas for future research.

... The same as in Fig. 2 but for the wind intensity error. above, a direct comparison with the operational centers may be less significant, nevertheless some insight on the performance of the present method may be obtained from the comparison of the windspeed errors of operational centers: Considering that the current error range of the wind-intensity 48 h-forecast as of 2008 is about 7.6 m s −1 , as averaged from the linear trends of 7.5 m s −1 of NHC and 7.7 m s −1 of RSMC Tokyo (see also DeMaria et al., 2007;Moskaitis, 2008), the present method seems to be useful in producing accurate wind-intensity. It may be expected that the performance of the present method will be further improved to a certain extent if a statistical correction to the model output is carried out as is practiced in operational centers (e.g., Elsberry et al., 1999). ...

A case study was made to investigate the emission control strategy to reduce the photochemical pollution over the Osaka Bay area of Japan. A simplified two-layer box model has been developed because of the advantage of comparative ease of calculations. The results show that the targeted area is hydrocarbons (RH) sensitive on reducing photochemical ozone (O3). Especially, the reduction in xylene (XYL) is the most effective in decreasing O3 compared to the reduction in the other hydrocarbons. For example, 50% reduction in XYL led to a decrease in the peak O3 concentration of about 25 ppbv from the existing level 67.1 ppbv. It is also suggested that the measures for reducing photochemical pollution in the Osaka Bay area should include the control of RH emissions from painting and printing sources. In addition, the simplified two-layer box model developed in this study shows good capability of prediction, and can be a useful tool in devising policies for reduction in emissions of primary precursors. The model is easy to use and takes less time for computation.

... The method adopted for evaluating wind radii forecasts is based on the joint distribution of pairs of forecasts and observations of the radii of 64-, 50-, and 34-kt winds in each of the four quadrants of a hurricane w(k, s), where k is the quadrant and s is the speed. A joint histogram (Aberson 2008;Moskaitis 2008), which is a plot of occurrence frequency versus two variables, is created from the forecast and observed pairs (w m , w o ) by binning the radius values for forecasts and observations for a given k and s. Columns of the joint histogram represent the probability distribution function (PDF) of forecast wind radii (w m ) for a given bin of observed wind radius (w o ). ...

The representation of tropical cyclone track, intensity, and structure in a set of 69 parallel forecasts performed at each of two horizontal grid increments with the Advanced Research Hurricane(AHW) component of the Weather and Research and Forecasting Model (WRF) is evaluated. These forecasts covered 10 Atlantic tropical cyclones: 6 from the 2005 season and 4 from 2007. The forecasts were integrated from identical initial conditions produced by a cycling ensemble Kalman filter. The high-resolution forecasts used moving, stormcentered nests of 4-and 1.33-km grid spacing. The coarse-resolution forecasts consisted of a single 12-km domain (which was identical to the outer domain in the forecasts with nests). Forecasts were evaluated out to 120 h. Novel verification techniques were developed to evaluate forecasts of wind radii and the degree of storm asymmetry. Intensity (maximum wind) and rapid intensification, as well as wind radii, were all predicted more accurately with increased horizontal resolution. These results were deemed to be statistically significant based on the application of bootstrap confidence intervals. No statistically significant differences emerged regarding storm position errors between the two forecasts. Coarse-resolution forecasts tended to overpredict the extent of winds compared to high-resolution forecasts. The asymmetry of gale-force winds was better predicted in the coarser-resolution simulation, but asymmetry of hurricane-force winds was predicted better at high resolution. The skill of the wind radii forecasts decayed gradually over 120 h, suggesting a synoptic-scale control of the predictability of outer winds.

... 3. The same as inFig. 2 but for the wind intensity error. above, a direct comparison with the operational centers may be less significant, nevertheless some insight on the performance of the present method may be obtained from the comparison of the windspeed errors of operational centers: Considering that the current error range of the wind-intensity 48 h-forecast as of 2008 is about 7.6 m s −1 , as averaged from the linear trends of 7.5 m s −1 of NHC and 7.7 m s −1 of RSMC Tokyo (see also DeMaria et al., 2007; Moskaitis, 2008), the present method seems to be useful in producing accurate wind-intensity. It may be expected that the performance of the present method will be further improved to a certain extent if a statistical correction to the model output is carried out as is practiced in operational centers (e.g., Elsberry et al., 1999). ...

A new Tropical Cyclone (TC) initialization method with the structure adjustable bogus vortex was applied to the forecasts
of track, central pressure, and wind intensity for the 417 TCs observed in the Western North Pacific during the 3-year period
of 2005–2007. In the simulations the Final Analyses (FNL) with 1° × 1° resolution of National Center for Environmental Prediction
(NCEP) were incorporated as initial conditions. The present method was shown to produce improved forecasts over those without
the TC initialization and those made by Regional Specialized Meteorological Center Tokyo. The average track (central pressure,
wind intensity) errors were as small as 78.0 km (11.4 hPa, 4.9 m s−1) and 139.9 km (12.4 hPa, 5.5 m s−1) for 24-h and 48-h forecasts, respectively. It was found that the forecast errors are almost independent on the size and
intensity of the observed TCs because the size and intensity of the bogus vortex can be adjusted to fit the best track data.
The results of this study indicate that a bogus method is useful in predicting simultaneously the track, central pressure,
and intensity with accuracy using a dynamical forecast model.
Key wordsTropical cyclone initialization–balanced bogus vortex–track and intensity prediction–spherical high-order filter

The Hurricane Forecast Improvement Project (HFIP; renamed the “Hurricane Forecast Improvement Program” in 2017) was established by the U.S. National Oceanic and Atmospheric Administration (NOAA) in 2007 with a goal of improving tropical cyclone (TC) track and intensity predictions. A major focus of HFIP has been to increase the quality of guidance products for these parameters that are available to forecasters at the National Weather Service National Hurricane Center (NWS/NHC). One HFIP effort involved the demonstration of an operational decision process, named Stream 1.5, in which promising experimental versions of numerical weather prediction models were selected for TC forecast guidance. The selection occurred every year from 2010 to 2014 in the period preceding the hurricane season (defined as August–October), and was based on an extensive verification exercise of retrospective TC forecasts from candidate experimental models run over previous hurricane seasons. As part of this process, user-responsive verification questions were identified via discussions between NHC staff and forecast verification experts, with additional questions considered each year. A suite of statistically meaningful verification approaches consisting of traditional and innovative methods was developed to respond to these questions. Two examples of the application of the Stream 1.5 evaluations are presented, and the benefits of this approach are discussed. These benefits include the ability to provide information to forecasters and others that is relevant for their decision-making processes, via the selection of models that meet forecast quality standards and are meaningful for demonstration to forecasters in the subsequent hurricane season; clarification of user-responsive strengths and weaknesses of the selected models; and identification of paths to model improvement.
Significance Statement
The Hurricane Forecast Improvement Project (HFIP) tropical cyclone (TC) forecast evaluation effort led to innovations in TC predictions as well as new capabilities to provide more meaningful and comprehensive information about model performance to forecast users. Such an effort—to clearly specify the needs of forecasters and clarify how forecast improvements should be measured in a “user-oriented” framework—is rare. This project provides a template for one approach to achieving that goal.

Solar irradiance forecasting is one of the most efficient methods to handle the potential problems caused by the large and frequent photovoltaic fluctuations. For the satellite-based forecasting method, the atmospheric attenuation is paid lesser attention than other parts (notably the cloud effects). This study aims to explore the possibility of improving irradiance forecasting by using an advanced clear-sky model (i.e., the McClear model) and the running-window based affine transformation with local measurements. The McClear model notably aims at accounting for aerosol and water vapor intraday variabilities, in contrast with the European solar radiation atlas (ESRA) model based on climatological monthly means of Linke turbidity. The affine transformation with a running window of few days in the sliding past can serve as a correction procedure and has the potential to lower the impacts by inaccurate atmospheric estimation. Irradiance forecasting is carried out at lead times from 15 min to 3 h at an interval of 15 min, based on China's second-generation geostationary satellite Fengyun-4A. The measure-oriented and distribution-oriented approaches are used for a comprehensive verification. The results show that without affine transformation, the forecasting model with the McClear model outperforms that with the ESRA model, due to better estimations of atmospheric attenuation. On the other hand, affine transformation significantly improves the forecasting models. Overestimations still exist but are significantly reduced to the range of 2%–5.5%. After affine transformation, the forecasting models achieve very close performances no matter which clear-sky model is implemented, except that forecasts with the McClear model are much better calibrated at a high irradiance level (i.e., 900 W/m²).

The field of energy forecasting has attracted many researchers from different fields (e.g., meteorology, data sciences, mechanical or electrical engineering) over the last decade. Solar forecasting is a fast-growing subdomain of energy forecasting. Despite several previous attempts, the methods and measures used for verification of deterministic (also known as single-valued or point) solar forecasts are still far from being standardized, making forecast analysis and comparison difficult.
To analyze and compare solar forecasts, the well-established Murphy–Winkler framework for distribution-oriented forecast verification is recommended as a standard practice. This framework examines aspects of forecast quality, such as reliability, resolution, association, or discrimination, and analyzes the joint distribution of forecasts and observations, which contains all time-independent information relevant to verification. To verify forecasts, one can use any graphical display or mathematical/statistical measure to provide insights and summarize the aspects of forecast quality. The majority of graphical methods and accuracy measures known to solar forecasters are specific methods under this general framework.
Additionally, measuring the overall skillfulness of forecasters is also of general interest. The use of the root mean square error (RMSE) skill score based on the optimal convex combination of climatology and persistence methods is highly recommended. By standardizing the accuracy measure and reference forecasting method, the RMSE skill score allows—with appropriate caveats—comparison of forecasts made using different models, across different locations and time periods.

Gridded solar radiation products, namely satellite-derived irradiance and reanalysis irradiance, are key to the next-generation solar resource assessment and forecasting. Since their accuracies are generally lower than that of the ground-based measurements, providing validation of the gridded solar radiation products is necessary in order to understand their qualities and characteristics. This article delivers a worldwide validation of hourly global horizontal irradiance derived from satellite imagery and reanalysis. The accuracies of 6 latest satellite-derived irradiance products (CAMS-RAD, NSRDB, SARAH-2, SARAH-E, CERES-SYN1deg, and Solcast) and 2 latest global reanalysis irradiance products (ERA5 and MERRA-2) are verified against the complete records from 57 BSRN stations, over 27 years (1992–2018). This scope of validation is unprecedented in the field of solar energy. Moreover, the importance of using distribution-oriented verification approaches is emphasized. Such approaches go beyond the traditional measure-oriented verification approach, and thus can offer additional insights and flexibility to the verification problem.

A diagnostic approach to forecast verification is described and illustrated. This approach is based on a general framework for forecast verification. It is “diagnostic” in the sense that it focuses on the fundamental characteristics of the forecasts, the corresponding observations, and their relationship. Three classes of diagnostic verification methods are identified: 1) the joint distribution of forecasts and observations and conditional and marginal distributions associated with factorizations of this joint distribution; 2) summary measures of these joint, conditional, and marginal distributions; and 3) performance measures and their decompositions. Linear regression models that can be used to describe the relationship between forecasts and observations are also presented. Graphical displays are advanced as a means of enhancing the utility of this body of diagnostic verification methodology. A sample of National Weather Service maximum temperature forecasts (and observations) for Minneapolis, ...

This paper provides a detailed description of the relationship between spring snow mass in the mountain areas of the western United States and summertime precipitation in the southwestern United States associated with the North American monsoon system and examines the hypothesis that antecedent spring snow mass can modulate monsoon rains through effects on land surface energy balance. Analysis of spring snow water equivalent (SWE) and July-August (JA) precipitation for the period of 1948-97 confirms the inverse snow-monsoon relationship noted in previous studies. Examination of regional difference in SWE-JA precipitation associations shows that although JA precipitation in New Mexico is significantly correlated with SWE over much larger areas than in Arizona, the overall strength of the correlations are just as strong in Arizona as in New Mexico. Results from this study also illustrate that the snow-monsoon relationship is unstable over time. In New Mexico, the relationship is strongest during 1965-92 and is weaker outside that period. By contrast, Arizona shows strongest snow-monsoon associations before 1970. The temporal coincidence between stronger snow-monsoon associations over Arizona and weaker snow-monsoon associations over New Mexico (and vice versa) suggests a common forcing mechanism and that the variations in the strength of snow-monsoon associations are more than just climate noise. There is a need to understand how other factors modulate monsoonal rainfall before realistic predictions of summertime precipitation in the Southwest can be made.

A method for routinely verifying numerical weather prediction surface marine winds with satellite scatterometer winds is introduced. The marine surface winds from the Australian Bureau of Meteorology's operational global and regional numerical weather prediction systems are evaluated. The model marine surface layer is described. Marine surface winds from the global and limited-area models are compared with observations, both in situ (anemometer) and remote (scatterometer). A 2-yr verification shows that wind speeds from the regional model are typically underestimated by approximately 5%, with a greater bias in the meridional direction than the zonal direction. The global model also underestimates the surface winds by around 5%-10%. A case study of a significant marine storm shows that where larger errors occur, they are due to an underestimation of the storm intensity, rather than to biases in the boundary layer parameterizations.

Using currently available operational forecast datasets on the tracks and intensities of tropical cyclones over the Pacific Ocean for the years 1998, 1999, and 2000 a multimodel superensemble has been constructed following the earlier work of the authors on the Atlantic hurricanes. The models included here include forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF), the National Centers for Environmental Prediction/Environmental Modeling Center [NCEP/EMC, the Aviation (AVN) and Medium-Range Forecast (MRF) Models], the U. S. Navy [Naval Operational Global Atmospheric Prediction System, (NOGAPS)], the U. K. Met Office (UKMO), and the Japan Meteorological Agency (JMA). The superensemble methodology includes a collective bias estimation from a training phase in which a multiple-regression-based least squares minimization principle for the model forecasts with respect to the observed measures is employed. This is quite different from a simple bias correction, whereby a mean value is simply shifted. These bias estimates are described by separate weights at every 12 h during the forecasts for each of the member models. Superensemble forecasts for track and intensity are then constructed up to 144 h into the future using these weights. Some 100 past forecasts of tropical cyclone days are used to define the training phase for each basin. The findings presented herein show a marked improvement for the tracks and intensities of forecasts from the proposed multimodel superensemble as compared to the forecasts from member models and the ensemble mean. This note includes detailed statistics on the Pacific Ocean tropical cyclone forecasts for the years 1998, 1999, and 2000.

The inability of a general circulation model (GCM) to predict the surface weather parameters accurately necessitates statistical interpretation of numerical weather prediction (NWP) model output. Here a system for forecasting maximum and minimum temperatures has been developed and implemented for 12 locations in India based on the perfect prog method (PPM) approach. The analyzed data from the ECMWF for a period of 6 yr (1985-90) are used to develop PPM model equations. Daily forecasts for maximum and minimum temperatures are then obtained from these equations by using T-80 model output. In order to assess the skill and quality of the temperature forecasts, an attempt has been made to verify them by employing the conditional and marginal distribution of forecasts and observations using the data of four monsoon seasons from 1997 through 2000.

Experimental gridded forecasts of surface temperature issued by National Weather Service offices in the western United States during the 2003/04 winter season (18 November 2003-29 February 2004) are evaluated relative to surface observations and gridded analyses. The 5-km horizontal resolution gridded forecasts issued at 0000 UTC for forecast lead times at 12-h intervals from 12 to 168 h were obtained from the National Digital Forecast Database (NDFD). Forecast accuracy and skill are determined relative to observations at over 3000 locations archived by MesoWest. Forecast quality is also determined relative to Rapid Update Cycle (RUC) analyses at 20-km resolution that are interpolated to the 5-km NDFD grid as well as objective analyses obtained from the Advanced Regional Prediction System Data Assimilation System that rely upon the MesoWest observations and RUC analyses. For the West as a whole, the experimental temperature forecasts issued at 0000 UTC during the 2003/04 winter season exhibit skill at lead times of 12, 24, 36, and 48 h on the basis of several verification approaches. Subgrid-scale temperature variations and observational and analysis errors undoubtedly contribute some uncertainty regarding these results. Even though the "true" values appropriate to evaluate the forecast values on the NDFD grid are unknown, it is estimated that the root-mean-square errors of the NDFD temperature forecasts are on the order of 3°C at lead times shorter than 48 h and greater than 4°C at lead times longer than 120 h. However, such estimates are derived from only a small fraction of the NDFD grid boxes. Incremental improvements in forecast accuracy as a result of forecaster adjustments to the 0000 UTC temperature grids from 144- to 24-h lead times are estimated to be on the order of 13%.

This paper presents an extension of the operational consensus forecast (OCF) method, which performs a statistical correction of model output at sites followed by weighted average consensus on a daily basis. Numerical weather prediction (NWP) model forecasts are received from international centers at various temporal resolutions. As such, in order to extend the OCF methodology to hourly temporal resolution, a method is described that blends multiple models regardless of their temporal resolution. The hourly OCF approach is used to generate forecasts of 2-m air temperature, dewpoint temperature, RH, mean sea level pressure derived from the barometric pressure at the station location (QNH), along with 10-m wind speed and direction for 283 Australian sites. In comparison to a finescale hourly regional model, the hourly OCF process results in reductions in average mean square error of 47% (air temperature), 40% (dewpoint temperature), 43% (RH), 29% (QNH), 42% (wind speed), and 11% (wind direction) during February March with slightly higher reductions in May. As part of the development of the approach, the systematic and random natures of hourly NWP forecast errors are evaluated and found to vary with forecast hour, with a diurnal modulation overlaying the normal error growth with time. The site-based statistical correction of the model forecasts is found to include simple statistical downscaling. As such, the method is found to be most appropriate for meteorological variables that vary systematically with spatial resolution.

This paper introduces the use of information theory in characterizing climate predictability. Specifically, the concepts of entropy and transinformation are employed. Entropy measures the amount of uncertainty in our knowledge of the state of the climate system. Transinformation represents the information gained about an anomaly at any time t with knowledge of the size of the initial anomaly. It has many desirable properties that can be used as a measure of the predictability of the climate system. These concepts when applied to climate predictability are illustrated through a simple stochastic climate model (an energy balance model forced by noise). The transinformation is found to depict the degradation of information about an anomaly despite the fact that we have perfect knowledge of the initial state. Its usefulness, especially when generalized to other climate models, is discussed.

The question of who is the "best" forecaster in a particular media market is one that the public frequently asks. The authors have collected approximately one year's forecasts from the National Weather Service and major media presentations for Oklahoma City. Diagnostic verification procedures indicate that the question of best does not have a clear answer. All of the forecast sources have strengths and weaknesses, and it is possible that a user could take information from a variety of sources to come up with a forecast that has more value than any one individual source provides. The analysis provides numerous examples of the utility of a distributions-oriented approach to verification while also providing insight into the problems the public faces in evaluating the array of forecasts presented to them.

Comparative verification of operational 6-h quantitative precipitation forecast (QPF) products used for stream- flow models run at National Weather Service (NWS) River Forecast Centers (RFCs) is presented. The QPF products include 1) national guidance produced by operational numerical weather prediction (NWP) models run at the National Centers for Environmental Prediction (NCEP), 2) guidance produced by forecasters at the Hydrometeorological Prediction Center (HPC) of NCEP for the conterminous United States, 3) local forecasts produced by forecasters at NWS Weather Forecast Offices (WFOs), and 4) the final QPF product for multi-WFO areas prepared by forecasters at RFCs. A major component of the study was development of a simple scoring methodology to indicate the relative accuracy of the various QPF products for NWS managers and possibly hydrologic users. The method is based on mean absolute error (MAE) and bias scores for continuous precipitation amounts grouped into mutually exclusive intervals. The grouping (stratification) was conducted on the basis of observed precipitation, which is customary, and also forecast precipitation. For ranking overall accuracy of each QPF product, the MAE for the two stratifications was objectively combined. The combined MAE could be particularly useful when the accuracy rankings for the individual stratifications are not consistent. MAE and bias scores from the comparative verification of 6-h QPF products during the 1998/99 cool season in the eastern United States for day 1 (0-24-h period) indicated that the HPC guidance performed slightly better than corre- sponding products issued by WFOs and RFCs. Nevertheless, the HPC product was only marginally better than the best-performing NCEP NWP model for QPF in the eastern United States, the Aviation (AVN) Model. In the western United States during the 1999/2000 cool season, the WFOs improved on the HPC guidance for day 1 but not for day 2 or day 3 (24-48- and 48-72-h periods, respectively). Also, both of these human QPF products improved on the AVN Model on day 1, but by day 3 neither did. These findings contributed to changes in the NWS QPF process for hydrologic model input.

The current version of the Statistical Typhoon Intensity Prediction Scheme (STIPS) used operationally at the Joint Typhoon Warning Center (JTWC) to provide 12-hourly tropical cyclone intensity guidance through day 5 is documented. STIPS is a multiple linear regression model. It was developed using a "perfect prog" assumption and has a statistical-dynamical framework, which utilizes environmental information obtained from Navy Operational Global Analysis and Prediction System (NOGAPS) analyses and the JTWC historical best track for development. NOGAPS forecast fields are used in real time. A separate version of the model (decay-STIPS) is produced that accounts for the effects of landfall by using an empirical inland decay model. Despite their simplicity, STIPS and decay-STIPS produce skillful intensity forecasts through 4 days, based on a 48-storm verification (July 2003-October 2004). Details of this model's development and operational performance are presented.

Tropical cyclone track forecasting has improved recently to the point at which extending the official forecasts of both track and intensity to 5 days is being considered at the National Hurricane Center and the Joint Typhoon Warning Center. Current verification procedures at both of these operational centers utilize a suite of control models, derived from the ''climatology'' and ''persistence'' techniques, that make forecasts out to 3 days. To evaluate and verify 5-day forecasts, the current suite of control forecasts needs to be redeveloped to extend the forecasts from 72 to 120 h. This paper describes the development of 5-day tropical cyclone intensity forecast models derived from climatology and persistence for the Atlantic, the eastern North Pacific, and the western North Pacific Oceans. Results using independent input data show that these new models possess similar error and bias characteristics when compared with their predecessors in the North Atlantic and eastern North Pacific but that the west Pacific model shows a statistically significant improvement when compared with its forerunner. Errors associated with these tropical cyclone intensity forecast models are also shown to level off beyond 3 days in all of the basins studied.

Nested limited-area models (LAMs) have been used by the scientific community for a long time, with the implicit assumption that they are able to generate meaningful small-scale features that were absent in the lateral boundary conditions and sometimes even in the initial conditions. This hypothesis has never been seriously challenged in spite of reservations expressed by part of the scientific community. In order to study this hypothesis, a perfect-model approach is followed. A high-resolution LAM driven by global analyses is used over a large domain to generate a ''reference run.'' These fields are filtered afterward to remove small scales in order to mimic low-resolution nesting data. The same high-resolution LAM, but over a small domain, is nested with these filtered fields and run for several days. The ability of the LAM to regenerate the small scales that were absent in the initial and lateral boundary conditions is estimated by comparing both runs over the same region. The simulations are analyzed for several variables using a distribution-oriented approach, which provides an estimation of the forecasting ability as a function of the value of the variable. It is found that variables with steep spectra, such as geopotential and temperature, display good forecasting skills for the entire range of values but improve little the forecast skill of a low-resolution perfect model. For noisier variables with flatter spectra, such as vorticity and precipitation, the high-resolution forecast provides a more realistic and extended range of forecast values for the variables, but rather low skill for extreme events. The probability of a successful forecast for these extreme cases, however, is much higher than that of a random model. When errors in the phase in the weather systems are not penalized, forecasting skill increases considerably. This suggests that, despite the inability to perform as pointwise deterministic forecasts, useful information may be generated by LAMs if considered in a probabilistic way.

A method is developed to adjust the Kaplan and DeMaria tropical cyclone inland wind decay model for storms that move over narrow landmasses. The basic assumption that the wind speed decay rate after landfall is proportional to the wind speed is modified to include a factor equal to the fraction of the storm circulation that is over land. The storm circulation is defined as a circular area with a fixed radius. Appli-cation of the modified model to Atlantic Ocean cases from 1967 to 2003 showed that a circulation radius of 110 km minimizes the bias in the total sample of landfalling cases and reduces the mean absolute error of the predicted maximum winds by about 12%. This radius is about 2 times the radius of maximum wind of a typical Atlantic tropical cyclone. The modified decay model was applied to the Statistical Hurricane Intensity Prediction Scheme (SHIPS), which uses the Kaplan and DeMaria decay model to adjust the intensity for the portion of the predicted track that is over land. The modified decay model reduced the intensity forecast errors by up to 8% relative to the original decay model for cases from 2001 to 2004 in which the storm was within 500 km from land.

The distributions-oriented approach to forecast verification uses an estimate of the joint distribution of forecasts and observations to evaluate forecast quality. However, small verification data samples can produce unreliable estimates of forecast quality due to sampling variability and biases. In this paper, new techniques for verification of probability forecasts of dichotomous events are presented. For forecasts of this type, simplified expressions for forecast quality measures can be derived from the joint distribution. Although traditional approaches assume that forecasts are discrete variables, the simplified expressions apply to either discrete or continuous forecasts. With the derived expressions, most of the forecast quality measures can be estimated analytically using sample moments of forecasts and observations from the verification data sample. Other measures require a statistical modeling approach for estimation. Results from Monte Carlo experiments for two forecasting examples show that the statistical modeling approach can significantly improve estimates of these measures in many situations. The improvement is achieved mostly by reducing the bias of forecast quality estimates and, for very small sample sizes, by slightly reducing the sampling variability. The statistical modeling techniques are most useful when the verification data sample is small (a few hundred forecast–observation pairs or less), and for verification of rare events, where the sampling variability of forecast quality measures is inherently large.

The authors have carried out verification of 590 12–24-h high-temperature forecasts from numerical guidance products and human forecasters for Oklahoma City, Oklahoma, using both a measures-oriented verification scheme and a distributions-oriented scheme. The latter captures the richness associated with the relationship of forecasts and observations, providing insight into strengths and weaknesses of the forecasting systems, and showing areas in which improvement in accuracy can be obtained. The analysis of this single forecast element at one lead time shows the amount of information available from a distributions-oriented verification scheme. In order to obtain a complete picture of the overall state of fore-casting, it would be necessary to verify all elements at all lead times. The authors urge the development of such a national verification scheme as soon as possible, since without it, it will be impossible to monitor changes in the quality of forecasts and forecasting systems in the future.

The Geophysical Fluid Dynamics Laboratory (GFDL) Hurricane Prediction System was adopted by the U.S. National Weather Service as an operational hurricane prediction model in the 1995 hurricane season. The framework of the prediction model is described with emphasis on its unique features. The model uses a multiply nested movable mesh system to depict the interior structure of tropical cyclones. For cumulus parameterization, a soft moist convective adjustment scheme is used. The model initial condition is defined through a method of vortex replacement. It involves generation of a realistic hurricane vortex by a scheme of controlled spinup. Time integration of the model is carried out by a two-step iterative method that has a characteristic of frequency-selective damping. The outline of the prediction system is presented and the system performance in the 1995 hurricane season is briefly summarized. Both in the Atlantic and the eastern Pacific, the average track forecast errors are substantially reduced by the GFDL model, compared with forecasts by other models, particularly for the forecast periods beyond 36 h. Forecasts of Hurricane Luis and Hurricane Marilyn were especially skillful. A forecast bias is noticed in cases of Hurricane Opal and other storms in the Gulf of Mexico. The importance of accurate initial conditions, in both the environmental flow and the storm structure, is argued.

Modifications to the Atlantic and east Pacific versions of the operational Statistical Hurricane Intensity Prediction Scheme (SHIPS) for each year from 1997 to 2003 are described. Major changes include the addition of a method to account for the storm decay over land in 2000, the extension of the forecasts from 3 to 5 days in 2001, and the use of an operational global model for the evaluation of the atmospheric predictors instead of a simple dry-adiabatic model beginning in 2001.
A verification of the SHIPS operational intensity forecasts is presented. Results show that the 1997–2003 SHIPS forecasts had statistically significant skill (relative to climatology and persistence) out to 72 h in the Atlantic, and at 48 and 72 h in the east Pacific. The inclusion of the land effects reduced the intensity errors by up to 15% in the Atlantic, and up to 3% in the east Pacific, primarily for the shorter-range forecasts. The inclusion of land effects did not significantly degrade the forecasts at any time period. Results also showed that the 4–5-day forecasts that began in 2001 did not have skill in the Atlantic, but had some skill in the east Pacific.
An experimental version of SHIPS that included satellite observations was tested during the 2002 and 2003 seasons. New predictors included brightness temperature information from Geostationary Operational Environmental Satellite (GOES) channel 4 (10.7 μm) imagery, and oceanic heat content (OHC) estimates inferred from satellite altimetry observations. The OHC estimates were only available for the Atlantic basin. The GOES data significantly improved the east Pacific forecasts by up to 7% at 12–72 h. The combination of GOES and satellite altimetry improved the Atlantic forecasts by up to 3.5% through 72 h for those storms west of 50°W.

The four primary predictors are 1) the difference between the current storm intensity and an estimate of the maximum possible intensity determined from the sea surface temperature, 2) the vertical shear of the horizontal wind, 3) persistence, and 4) the flux convergence of eddy angular momentum evaluated at 200 mb. The sea surface temperature and vertical shear variables are averaged along the track of the storm during the forecast period. The sea surface temperatures along the storm track are determined from monthly climatological analyses linearly interpolated to the position and date of the storm. The vertical shear values along the track of the storm are estimated using the synoptic analysis at the beginning of the forecast period. All other predictors are evaluated at the beginning of the forecast period. -from Authors

The past decade has been marked by significant advancements in numerical weather prediction of hurricanes, which have greatly contributed to the steady decline in forecast track error. Since its operational implementation by the U.S. National Weather Service (NWS) in 1995, the best-track model performer has been NOAA's regional hurricane model developed at the Geophysical Fluid Dynamics Laboratory (GFDL). The purpose of this paper is to summarize the major upgrades to the GFDL hurricane forecast system since 1998. These include coupling the atmospheric component with the Princeton Ocean Model, which became operational in 2001, major physics upgrades implemented in 2003 and 2006, and increases in both the vertical resolution in 2003 and the horizontal resolution in 2002 and 2005. The paper will also report on the GFDL model performance for both track and intensity, focusing particularly on the 2003 through 2006 hurricane seasons. During this period, the GFDL track errors were the lowest of all the dynamical model guidance available to the NWS Tropical Prediction Center in both the Atlantic and eastern Pacific basins. It will also be shown that the GFDL model has exhibited a steady reduction in its intensity errors during the past 5 yr, and can now provide skillful intensity forecasts. Tests of 153 forecasts from the 2004 and 2005 Atlantic hurricane seasons and 75 forecasts from the 2005 eastern Pacific season have demonstrated a positive impact on both track and intensity prediction in the 2006 GFDL model upgrade, through introduction of a cloud microphysics package and an improved air-sea momentum flux parameterization. In addition, the large positive intensity bias in sheared environments observed in previous versions of the model is significantly reduced. This led to the significant improvement in the model's reliability and skill for forecasting intensity that occurred in 2006.

The 2010 Atlantic hurricane season had above normal activity, with 404 official forecasts issued. The NHC official track forecast errors in the Atlantic basin were similar to the previous 5-yr means from 12-36 h, but up to 26% smaller beyond 36 h, and set a record for accuracy at 120 h. On average, the skill of the official forecasts was very close to that of the TCON/TVCN consensus models, as well as to the best performing of the dynamical models. The EMXI and GFSI exhibited the highest skill, and the EGRI performed well at longer forecast times. The NGPI and GFNI were the poorer performing major dynamical models. Among the consensus models, FSSE (a corrected consensus model) performed the best overall for the second year in a row. The corrected versions of TCON, TVCN, and GUNA, however, did not perform as well as their parent models. The Government Performance and Results Act of 1993 (GPRA) track goal was met. Official intensity errors for the Atlantic basin in 2010 were above the 5-yr means at 12-48 h, but below the 5-yr means at the remaining lead times. Decay-SHIFOR errors in 2010 were above their 5-yr means at all forecast times, indicating the season's storms were more difficult to forecast than normal. Regarding the individual intensity guidance models, the consensus models ICON/IVCN were among the best performers at 12-48 h, with the LGEM showing similar or superior skill at 72-120 h. The FSSE skill was very near that of the ICON/IVCN at 12-72 h, but decreased sharply beyond that. The dynamical models were the worst performers, but did show competitive skill at longer forecast times. The GPRA intensity goal was not met.

The traditional approach to forecast verification consists of computing one, or at most very few, quantities from a set of forecasts and verifying observations. However, this approach necessarily discards a large portion of the information regarding forecast quality that is contained in a set of forecasts and observations. Theoretically sound alternative verification approaches exist, but these often involve computation and examination of many quantities in order to obtain a complete description of forecast quality and, thus, pose difficulties in interpretation. This paper proposes and illustrates an intermediate approach to forecast verification, in which the multifaceted nature of forecast quality is recognized but the description of forecast quality is encapsulated in a much smaller number of parameters. These parameters are derived from statistical models fit to verification datasets. Forecasting performance as characterized by the statistical models can then be assessed in a relatively complete manner. In addition, the fitted statistical models provide a mechanism for smoothing sampling variations in particular finite samples of forecasts and observations. This approach to forecast verification is illustrated by evaluating and comparing selected samples of probability of precipitation (PoP) forecasts and the matching binary observations. A linear regression model is fit to the conditional distributions of the observations given the forecasts and a beta distribution is fit to the frequencies of use of the allowable probabilities. Taken together, these two models describe the joint distribution of forecasts and observations, and reduce a 21-dimensional verification problem to 4 dimensions (two parameters each for the regression and beta models). Performance of the selected PoP forecasts is evaluated and compared across forecast type, location, and lead time in terms of these four parameters (and simple functions of the parameters), and selected graphical displays are explored as a means of obtaining relatively transparent views of forecasting performance within this approach to verification.

Five statistical and dynamical tropical cyclone intensity guidance techniques available at the National Hurricane Center (NHC) during the 2003 and 2004 Atlantic and eastern North Pacific seasons were evaluated within three intensity phases: (I) formation; (II) early intensification, with a subcategory (IIa) of a decay and reintensification cycle; and (III) decay. In phase I in the Atlantic, the various techniques tended to predict that a tropical storm would form from six tropical depressions that did not develop further, and thus the tendency was for false alarms in these cases. For the other 24 depressions that did become tropical storms, the statistical-dynamical techniques, statistical hurricane prediction scheme (SHIPS) and decay SHIPS (DSHIPS), have some skill relative to the 5-day statistical hurricane intensity forecast climatology and persistence technique, but they also tend to intensify all depressions and thus are prone to false alarms. In phase II, the statistical-dynamical models SHIPS and DSHIPS do not predict the rapid intensification cases (≥30 kt in 24 h) 48 h in advance. Although the dynamical Geophysical Fluid Dynamics Interpolated model does predict rapid intensification, many of these cases are at the incorrect times with many false alarms. The best performances in forecasting at least 24 h in advance the 21 decay and reintensification cycles in the Atlantic were the three forecasts by the dynamical Geophysical Fluid Dynamics Model-Navy (interpolated) model. Whereas DSHIPS was the best technique in the Atlantic during the decay phase III, none of the techniques excelled in the eastern North Pacific. All techniques tend to decay the tropical cyclones in both basins too slowly, except that DSHIPS performed well (12 of 18) during rapid decay events in the Atlantic. This evaluation indicates where NHC forecasters have deficient guidance and thus where research is necessary for improving intensity forecasts.

ABSTRACT The performance,of the Climate Prediction Center’s long-lead forecasts for the period 1995‐98 is assessed through a diagnostic verification, which involves examination of the full joint frequency distributions of the forecasts and the corresponding,observations. The most striking results of the verifications are the strong cool and dry biases of the outlooks. These seem,clearly related to the 1995‐98 period being warmer,and wetter than the 1961‐90 climatological base period. This bias results in the ranked probability score indicating very low skill for both temperature and precipitation forecasts at all leads. However, the temperature forecasts at all leads, and the precipitation forecasts for leads up to a few months, exhibit very substantial resolution: low (high) forecast probabilities are consistently associated with lower (higher) than average relative frequency of event occurrence, even though these relative frequencies are substantially different (because of the unconditional biases) from the forecast probabilities. Conditional biases, related to systematic under- or overconfidence on the part of the forecasters, are also evident in some circumstances.

A general framework for the problem of absolute verification (AV) is extended to the problem of comparative verification (CV). Absolute verification focuses on the performance of individual forecasting systems (or forecasters), and it is based on the bivariate distributions of forecasts and observations and its two possible factorizations into conditional and marginal distributions. Comparative verification compares the performance of two or more forecasting systems, which may produce forecasts under 1) identical conditions or 2) different conditions. Complexity can be defined in terms of the number of factorizations, the number of basic factors (conditional and marginal distributions) in each factorization, or the total number of basic factors associated with the respective frameworks. Dimensionality is defined as the number of probabilities that must be specified to reconstruct the basic distribution of forecasts and observations. Failure to take account of the complexity and dimensionality of verification problems may lead to an incomplete and inefficient body of verification methodology and, thereby, to erroneous conclusions regarding the absolute and relative quality and/or value of forecasting systems. -from Author

A new ocean data assimilation and initialization procedure is presented. It was developed to obtain more realistic initial ocean conditions, including the position and structure of the Gulf Stream (GS) and Loop Current (LC), in the Geophysical Fluid Dynamics Laboratory/University of Rhode Island (GFDL/URI) coupled hurricane prediction system used operationally at the National Centers for Environmental Predic- tion. This procedure is based on a feature-modeling approach that allows a realistic simulation of the cross-frontal temperature, salinity, and velocity of oceanic fronts. While previous feature models used analytical formulas to represent frontal structures, the new procedure uses the innovative method of cross-frontal "sharpening" of the background temperature and salinity fields. The sharpening is guided by observed cross sections obtained in specialized field experiments in the GS. The ocean currents are spun up by integrating the ocean model for 2 days, which was sufficient for the velocity fields to adjust to the strong gradients of temperature and salinity in the main thermocline in the GS and LC. A new feature-modeling approach was also developed for the initialization of a multicurrent system in the Caribbean Sea, which provides the LC source. The initialization procedure is demonstrated for coupled model forecasts of Hurricane Isidore (2002).

Numerical forecasts of heavy warm-season precipitation events are verified using simple composite collection techniques. Various sampling methods and statistical measures are employed to evaluate the general characteristics of the precipitation forecasts. High natural variability is investigated in terms of its effects on the relevance of the resultant statistics. Natural variability decreases the ability of a verification scheme to discriminate between systematic and random error. The effects of natural variability can be mitigated by compositing multiple events with similar properties. However, considerable sample variance is inevitable because of the extreme diversity of mesoscale precipitation structures.
The results indicate that forecasts of heavy precipitation were often correct in that heavy precipitation was observed relatively close to the predicted area. However, many heavy events were missed due in part to the poor prediction of convection. Targeted composites of the missed events indicate that a large percentage of the poor forecasts were dominated by convectively parameterized precipitation. Further results indicate that a systematic northward bias in the predicted precipitation maxima is related to the deficits in the prediction of subsynoptically forced convection.

Ensemble streamflow prediction systems produce forecasts in the form of a conditional probability distribution for a continuous forecast variable. A distributions-oriented approach is presented for verification of these probability distribution forecasts. First, a flow threshold is used to transform the ensemble forecast into a probability forecast for a dichotomous event. The event is said to occur if the observed flow is less than or equal to the threshold; the probability forecast is the probability that the event occurs. The distributions-oriented approach, which has been developed for meteorological forecast verification, is then applied to estimate forecast quality measures for a verification dataset. The results are summarized for thresholds chosen to cover the range of possible flow outcomes. To aid in the comparison for different thresholds, relative measures are used to assess forecast quality. An application with experimental forecasts for the Des Moines River basin illustrates the approach. The application demonstrates the added insights on forecast quality gained through this approach, as compared to more traditional ensemble verification approaches. By examining aspects of forecast quality over the range of possible flow outcomes, the distributions-oriented approach facilitates a diagnostic evaluation of ensemble forecasting systems.

This paper explores the relationship between the quality and value of imperfect forecasts. It is assumed that these forecasts are produced by a primitive probabilistic forecasting system and that the decision-making problem of concern is the cost-loss ratio situation. In this context, two parameters describing basic characteristics of the forecasts must be specified in order to determine forecast quality uniquely. As a result, a scalar measure of accuracy such as the Brier score cannot completely and unambiguously describe the quality of the imperfect forecasts. The relationship between forecast accuracy and forecast value is represented by a multivalued function—an accuracy/value envelope. Existence of this envelope implies that the Brier score is an imprecise measure of value and that forecast value can even decrease as forecast accuracy increases (and vice versa). The generality of these results and their implications for verification procedures and practices are discussed.

A general framework for forecast verification based on the joint distribution of forecasts and observations is described: 1) the calibration-refinement factorization, which involves the conditional distributions of observations given forecasts and the marginal distribution of forecasts, and 2) the likelihood-base rate factorization, which involves the conditional distributions of forecasts given observations and the marginal distribution of observations. The names given to the factorizations reflect the fact that they relate to different attributes of the forecasts and/or observations. Some insight into the potential utility of the framework is provided by demonstrating that basic elements and summary measures of the joint, conditional, and marginal distributions play key roles in current verification methods. -from Authors

In order to investigate the effect of tropical cyclone-ocean interaction on the intensity of observed hurricanes, the GFDL movable triply nested mesh hurricane model was coupled with a high-resolution version of the Princeton Ocean Model. The ocean model had 1/68 uniform resolution, which matched the horizontal resolution of the hurricane model in its innermost grid. Experiments were run with and without inclusion of the coupling for two cases of Hurricane Opal (1995) and one case of Hurricane Gilbert (1988) in the Gulf of Mexico and two cases each of Hurricanes Felix (1995) and Fran (1996) in the western Atlantic. The results confirmed the conclusions suggested by the earlier idealized studies that the cooling of the sea surface induced by the tropical cyclone will have a significant impact on the intensity of observed storms, particularly for slow moving storms where the SST decrease is greater. In each of the seven forecasts, the ocean coupling led to substantial im- provements in the prediction of storm intensity measured by the storm's minimum sea level pressure. Without the effect of coupling the GFDL model incorrectly forecasted 25-hPa deepening of Gilbert as it moved across the Gulf of Mexico. With the coupling included, the model storm deepened only 10 hPa, which was much closer to the observed amount of 4 hPa. Similarly, during the period that Opal moved very slowly in the southern Gulf of Mexico, the coupled model produced a large SST decrease northwest of the Yucatan and slow deepening consistent with the observations. The uncoupled model using the initial NCEP SSTs predicted rapid deepening of 58 hPa during the same period. Improved intensity prediction was achieved both for Hurricanes Felix and Fran in the western Atlantic. For the case of Hurricane Fran, the coarse resolution of the NCEP SST analysis could not resolve Hurricane Edouard's wake, which was produced when Edouard moved in nearly an identical path to Fran four days earlier. As a result, the operational GFDL forecast using the operational SSTs and without coupling incorrectly forecasted 40-hPa deepening while Fran remained at nearly constant intensity as it crossed the wake. When the coupled model was run with Edouard's cold wake generated by imposing hurricane wind forcing during the ocean initialization, the intensity prediction was significantly improved. The model also correctly predicted the rapid deepening that occurred as Fran began to move away from the cold wake. These results suggest the importance of an accurate initial SST analysis as well as the inclusion of the ocean coupling, for accurate hurricane intensity prediction with a dynamical model. Recently, the GFDL hurricane-ocean coupled model used in these case studies was run on 163 forecasts during the 1995-98 seasons. Improved intensity forecasts were again achieved with the mean absolute error in the forecast of central pressure reduced by about 26% compared to the operational GFDL model. During the 1998 season, when the system was run in near-real time, the coupled model improved the intensity forecasts for all storms with central pressure higher than 940 hPa although the most significant improvement (;60%) occurred in the intensity range of 960-970 hPa. These much larger sample sets confirmed the conclusion from the case studies, that the hurricane-ocean interaction is an important physical mechanism in the intensity of observed tropical cyclones.

Skill scores defined as measures of relative mean square error-and based on standards of reference representing climatology, persistence, or a linear combination of climatology and persistence-are decomposed. Two decompositions of each skill score are formulated: 1) a decomposition derived by conditioning on the forecasts and 2) a decomposition derived by conditioning on the observations. These general decompositions contain terms consisting of measures of statistical characteristics of the forecasts and/or observations and terms consisting of measures of basic aspects of forecast quality. Properties of the terms in the respective decompositions are examined, and relationships among the various skill scores-and the terms in the respective decompositions-are described. Hypothetical samples of binary forecasts and observations are used to illustrate the application and interpretation of these decompositions. Limitations on the inferences that can be drawn from comparative verification based on skill scores, as well as from comparisons based on the terms in decompositions of skill scores, are discussed. The relationship between the application of measures of aspects of quality and the application of the sufficiency relation (a statistical relation that embodies the concept of unambiguous superiority) is briefly explored. The following results can be gleaned from this methodological study. 1) Decompositions of skill scores provide quantitative measures of-and insights into-multiple aspects of the forecasts, the observations, and their relationship. 2) Superiority in terms of overall skill is no guarantor of superiority in terms of other aspects of quality. 3) Sufficiency (i.e., unambiguous superiority) generally cannot be inferred solely on the basis of superiority over a relatively small set of measures of specific aspects of quality. Neither individual measures of overall performance (e.g., skill scores) nor sets of measures associated with decompositions of such overall measures respect the dimensionality of most verification problems. Nevertheless, the decompositions described here identify parsimonious sets of measures of basic aspects of forecast quality that should prove to be useful in many verification problems encountered in the real world.

This paper presents a framework for quantifying predictability based on the behavior of imperfect forecasts. The critical quantity in this framework is not the forecast distribution, as used in many other predictability studies, but the conditional distribution of the state given the forecasts, called the regression forecast distribution. The average predictability of the regression forecast distribution is given by a quantity called the mutual information. Standard inequalities in information theory show that this quantity is bounded above by the average predictability of the true system and by the average predictability of the forecast system. These bounds clarify the role of potential predictability, of which many incorrect statements can be found in the literature. Mutual information has further attractive properties: it is invariant with respect to nonlinear transformations of the data, cannot be improved by manipulating the forecast, and reduces to familiar measures of correlation skill when the forecast and verification are joint normally distributed. The concept of potential predictable components is shown to define a lower-dimensional space that captures the full predictability of the regression forecast without loss of generality. The predictability of stationary, Gaussian, Markov systems is examined in detail. Some simple numerical examples suggest that imperfect forecasts are not always useful for joint normally distributed systems since greater predictability often can be obtained directly from observations. Rather, the usefulness of imperfect forecasts appears to lie in the fact that they can identify potential predictable components and capture nonstationary and/or nonlinear behavior, which are difficult to capture by low-dimensional, empirical models estimated from short historical records.

A new vector partition of the probability, or Brier, score (PS) is
formulated and the nature and properties of this partition are
described. The relationships between the terms in this partition and the
terms in the original vector partition of the PS are indicated. The new
partition consists of three terms: 1) a measure of the uncertainty
inherent in the events, or states, on the occasions of concern (namely,
the PS for the sample relative frequencies); 2) a measure of the
reliability of the forecasts; and 3) a new measure of the resolution of
the forecasts. These measures of reliability and resolution are and are
not, respectively, equivalent (i.e., linearly related) to the measures
of reliability and resolution provided by the original partition. Two
sample collections of probability forecasts are used to illustrate the
differences and relationships between these partitions. Finally, the two
partitions are compared, with particular reference to the attributes of
the forecasts with which the partitions are concerned, the
interpretation of the partitions in geometric terms, and the use of the
partitions as the bases for the formulation of measures to evaluate
probability forecasts. The results of these comparisons indicate that
the new partition offers certain advantages vis-à-vis the
original partition.

The influence of various environmental factors on tropical cyclone intensity is explored using a simple coupled ocean atmosphere model. It is first demonstrated that this model is capable of accurately replicating the intensity evolution of storms that move over oceans whose upper thermal structure is not far from monthly mean climatology and that are relatively unaffected by environmental wind shear. A parameterization of the effects of environmental wind shear is then developed and shown to work reasonably well in several cases for which the magnitude of the shear is relatively well known. When used for real-time forecasting guidance, the model is shown to perform better than other existing numerical models while being competitive with statistical methods. In the context of a limited number of case studies, the model is used to explore the sensitivity of storm intensity to its initialization and to a number of environmental factors, including potential intensity, storm track, wind shear, upper-ocean thermal structure, bathymetry, and land surface characteristics. All of these factors are shown to influence storm intensity, with their relative contributions varying greatly in space and time. It is argued that, in most cases, the greatest source of uncertainty in forecasts of storm intensity is uncertainty in forecast values of the environmental wind shear, the presence of which also reduces the inherent predictability of storm intensity.

Updates to the Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic basin are described. SHIPS combines climatological, persistence, and synoptic predictors to forecast intensity changes using a multiple regression technique. The original version of the model was developed for the Atlantic basin and was run in near-real time at the Hurricane Research Division beginning in 1993. In 1996, the model was incorporated into the National Hurricane Center operational forecast cycle, and a version was developed for the eastern North Pacific basin. Analysis of the forecast errors for the period 1993-96 shows that SHIPS had little skill relative to forecasts based upon climatology and persistence. However, SHIPS had significant skill in both the Atlantic and east Pacific basins during the 1997 hurricane season. The regression coefficients for SHIPS were rederived after each hurricane season since 1993 so that the previous season's forecast cases were included in the sample. Modifications to the model itself were also made after each season. Prior to the 1997 season, the synoptic predictors were determined only from an analysis at the beginning of the forecast period. Thus, SHIPS could be considered a ''statistical-synoptic'' model. For the 1997 season, methods were developed to remove the tropical cyclone circulation from the global model analyses and to include synoptic predictors from forecast fields, so the current version of SHIPS is a ''statistical- dynamical'' model. It was only after the modifications for 1997 that the model showed significant intensity forecast skill.

The quantitative analysis of netilmicin in plasma, peritoneal dialysate, and urine using the fluorescence polarization immunoassay (FPIA) of the Abbott TDx system is compared with the modified high-performance liquid chromatography (HPLC) method of Peng et al., which was chosen as a reference. Using the least square method, we found that the results of the FPIA (y) correlated well with those obtained with HPLC (x). The three regression equations for the plasma, peritoneal dialysate, and urine samples, respectively, were y = 0.71x + 0.44 with r = 0.88 and n = 45; y = 0.94x + 1.22 with r = 0.93 and n = 95; and y = 0.92x + 0.70 with r = 0.93 and n = 61. The corresponding mean errors (FPIA-HPLC) with their 95% confidence intervals were -0.19 (-0.38 to -0.02), 0.69 (-0.42 to 1.81), and -0.13 (-1.13 to 0.87) microgram/ml. According to results of the Wilcoxon matched-pairs signed-ranks test, these errors did not represent a significant bias. The FPIA is thus suitable for analyzing netilmicin in the three biological fluids studied except when dialysate is contaminated with Amuchina. In this case, HPLC should be used.

Basic concepts. Forecast Verification: A Practitioner's Guide in Atmospheric Science

- J M Potts

Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner's Guide in Atmospheric Science, I. T. Jolliffe and D. B.
Stephenson, Eds., Wiley, 13-36.

Forecast Verification: A Practitioner's Guide in Atmospheric Science

- M Déqué

Déqué, M., 2003: Continuous variables. Forecast Verification: A
Practitioner's Guide in Atmospheric Science, I. T. Jolliffe and
D. B. Stephenson, Eds., Wiley, 97-119.

Forecast Verification: A Practitioner's Guide in Atmospheric Science

- I T Jolliffe
- D B Stephenson

Jolliffe, I. T., and D. B. Stephenson, Eds., 2003: Forecast Verification: A Practitioner's Guide in Atmospheric Science. Wiley,
240 pp.