# Global Mosquito Alert Models

## Model Overview

These models provide estimates of the probability of reports being sent through Mosquito Alert during a given month from each area shown on the map assuming full sampling effort everywhere. In the case of mosquito bite reports, the estimates are of the probability of at least one such report being sent during the given month. In the case of the targeted adult mosquito reports (*Ae. albopictus*, *Ae. koreicus*, *Ae. japonicus*, or *Culex*), the estimates are of the probability of at least report being sent during the given month that is classified by the digital entolab experts as possibly or probably representing the given target species/genus (i.e. a score of 1 or 2).

*Note that the Culex model is still being refined. Although it is described here, the results are not yet being displayed on the public webmap. We hope to have them there soon.*

These estimates should be roughly correlated with the probability of a person being bitten by a mosquito or of encountering an adult mosquito of each given target species/genus. Note however, that the probabilities faced by a person are not what is being directly modeled. Instead, the models start with a set of spatio-temporal units (see more on this below) and they make estimates about the probabilities of reports being received from these units. The models are similar to the ones described in Palmer et al. (2017), but now scaled up beyond Spain and with some changes in the units of analysis and covariates used.

These model estimates are displayed on the Mosquito Alert public webmap at https://map.mosquitoalert.com/spa/models. All of the model estimates that appear on the map can be accessed in csv format on GitHub at https://github.com/Mosquito-Alert/global_model_estimates.

## Units of analysis

The units of analysis in these models are areal-unit-months, with the areal units defined by level 4 of the Global Administrative Areas Database (GADM) (Global Administrative Areas 2022; see Brigham, Gilbert, and Xu 2011), with slight modifications. GADM level 4 corresponds to Eurostat’s Local Administrative Units Level 2, which are municipalities in most of the countries included in the model. For those countries for which GADM level 4 is not available (e.g. Andorra), the next lowest GADM level is used instead. (GADM is used here instead of Eurostat’s NUTS or LAU because these models will soon be expanded beyond Europe.)

Model estimates are shown on a map that can be zoomed out to reveal GADM levels 3, 2, and 1 as well. For these units, estimates are made by aggregating probabilities such that the interpretation remains the same: The estimate at each level is of the probability of at least one report being sent from the given unit. (Specifically, we aggregate units by taking the complement of the product of probabilities of no reports being sent, after first rescaling to keep sampling effort constant.)

## Model Specification and Covariates

The models are Bayesian multilevel logistic regressions estimated using Stan (Stan Development Team 2022) via the brms inferface for R (Bürkner 2021; R Core Team 2022). The log odds of at least one report (as explained above) is modeled as a function of a set of land cover variables, a set of weather variables, sampling effort, and area, with random intercepts at the GADM 1 level. (GADM 1 corresponds to NUTS 2 in Europe. In Spain, for example, these are the autonomous communities.)

The covariates used in the models are defined as follows:

- SE: Sampling effort. This is estimated from Mosquito Alert’s optional background tracking module, which provides approx. 5 locations per day for each participant who has not opted out of it, at random times. All locations are masked to a grid of 0.025 degrees latitude and longitude before being transmitted from the participant’s device to the server. Participants’ propensity to send any report is estimated, based on how long each participant has had the app, as the discrete empirical hazard from the reporting data. SE depends on the number of participants in each sampling-cell-month and each of these participant’s reporting propensity, with propensities aggregated such that the SE value can be interpreted as the probabity of at least one report (of any type, valid or not valid) coming from the given spatio-temporal unit. For more on Mosquito Alert sampling effort, see Palmer et al. (2017) and Bartumeus, Oltra, and Palmer (2018).
- TEMP: Temperature at 2 meters above the surface in C. From ERA5-Land monthly averaged data from 1950 to present (Muñoz Sabater 2019; Copernicus Climate Change Service 2022).
- RH: Relative humidity. From ERA5-Land monthly averaged data from 1950 to present (Muñoz Sabater 2019; Copernicus Climate Change Service 2022).
- W: Windspeed in meters per second. From ERA5-Land monthly averaged data from 1950 to present (Muñoz Sabater 2019; Copernicus Climate Change Service 2022).
- DUF: Proportion of the areal unit covered by discontinuous urban fabric. Code 1.1.2 from the 2018 CORINE Land Cover dataset (Büttner et al. 2021).
- CUF: Proportion of the areal unit covered by continuous urban fabric. Code 1.1.1 from the 2018 CORINE Land Cover dataset (Büttner et al. 2021).
- GUA: Proportion of the areal unit covered by green urban area. Code 1.4.1 from the 2018 CORINE Land Cover dataset (Büttner et al. 2021).
- FOR: Proportion of the areal unit covered by forests, shrubs, and /or herbacious vegetation. Codes 3.1 and 3.2 from the 2018 CORINE Land Cover dataset (Büttner et al. 2021).
- AGR: Proportion of the areal unit covered by agricultural areas. Codes 2.1, 2.2, 2.3, and 2.4 from the 2018 CORINE Land Cover dataset (Büttner et al. 2021).
- CG: Country Group. This is a categorical variable that makes it possilbe to control for country-specific reporting behavior among the three countries from which the most reports have been sent: Spain, Italy, and the Netherlands. Each of these is its own category; a fourth category is used for all other countries combined.

If we define the probability of at least one target report coming from GAD4 unit *i* within GADM1 unit *j* in country group *k* during month *t* as \pi_{ijkt}, then we can write the models for each of the three targeted *Aedes* species (*Ae. albopictus*, *Ae. koreicus*, and *Ae. japonicus*) as:

\begin{aligned} \textrm{log}\left( \frac{\pi_{ijkt}}{1-\pi_{ijkt}}\right) = & \,\textrm{log}(\textrm{SE}_{i}) + \textrm{log}(\textrm{area}_{i}) + \alpha_{1j} + \alpha_{2k} + \beta_1\textrm{TEMP}_{it} + \beta_{2}\textrm{TEMP}_{it}^2 + \beta_{3}\textrm{RH}_{it} + \\ & \, \beta_{4}\textrm{W}_{it} + \beta_{5}\textrm{DUF}_{i} + \beta_{6}\textrm{CUF}_{i} + \beta_{7}\textrm{FOR}_{i} + \beta_{8}\textrm{AGR}_{i} \\ \end{aligned}

All of the abbreviations in this equation are as defined above. Sampling effort (SE) and area enter these models as offsets; results can, thus, be interpreted as probabilities per unit of sampling effort and area. The parameters \alpha_j represent a set of random intercepts capturing variation at the GADM 1 level. The \beta terms are coefficients estimated on all of the other covariates.

Taking these same terms but with small changes in the covariates used, we can write the bites model as:

\begin{aligned} \textrm{log}\left( \frac{\pi_{ijkt}}{1-\pi_{ijkt}}\right) = & \, \textrm{log}(\textrm{SE}_{i}) + \textrm{log}(\textrm{area}_{i}) + \alpha_{1j} + \alpha_{2k} + \beta_1\textrm{TEMP}_{it} + \beta_{2}\textrm{TEMP}_{it}^2 + \beta_{3}\textrm{RH}_{it} + \\ & \,\beta_{4}\textrm{log(DUF)}_{i} + \beta_{5}\textrm{log(GUA)}_{i} + \beta_{6}\textrm{log(AGR)}_{i} \\ \end{aligned}

Finally, the *Culex* model (which is somewhat different from the others in that it does not include land cover) can be written as:

\begin{aligned} \textrm{log}\left( \frac{\pi_{ijkt}}{1-\pi_{ijkt}} \right) = & \, \textrm{log}(\textrm{SE}_{i}) + \textrm{log}(\textrm{area}_{i}) + \alpha_{1j} + \beta_1\textrm{TEMP}_{it} + \beta_{2}\textrm{TEMP}_{it}^2 + \beta_{3}\textrm{RH}_{it} + \\ & \, \beta_{4}\textrm{W}_{it} \\ \end{aligned}

## Spatial and Temporal Extents

Currently, all model estimates are limited to the spatial extent of the Corine Land Cover data set, which includes all members of the European Environmental Agency as well as cooperating countries (Büttner 2014). Geographic coverage will be expanded in future models. Within that spatial extent, the models for adult mosquito targets (*Ae. albopictus*, *Ae. koreicus*, *Ae. japonicus*, and *Culex*) are estimated using only data from GADM 4 units in which the target species is listed by the ECDC as “established” or “introduced” or from which which validated Mosquito Alert reports have been received. The bite model is not limited in this way.

The *Ae. albopictus* models include data from June 2014 to present, while the others include data from January 2021 to present. (Using January 2021 as the starting point ignores some data on the other target species submitted before that date but avoids the problem of how to interpret and control for sampling effort for these targets in those earlier periods, when Mosquito Alert was not yet, or had only just begun, asking people to report these species.) Finally, the models include only GADM-4-months from which at least some sampling effort was detected.

## Model Predictions

Model predictions are made within the outer spatial and temporal extents defined above for all of GADM-4 units in which the target species is listed by the ECDC as “established” or “introduced” or from which which validated Mosquito Alert reports have been received. Predictions are not limited based on sampling effort. Instead, the predictions are made as if sampling effort were equal to 1 in all of these GADM-4-months. GADM-4 units for which ECDC lists the target as “absent” and from which no valid Mosquito Alert reports have been received are assigned predicted values of 0. All other GADM-4 units are left without predicted values.

Model predictions are made ignoring the country-level random intercepts, as these intercepts are expected to capture differences in sampling effort not sufficiently captured by the SE variable itself. In contrast, GADM-1-level random intercepts are used in the predictions in order to incorporate information about the observed geographic distribution.

Finally, predictions from the *Culex* model are forced to zero for the winter months because the models currently pick up unrealistic estimates during winter. For all other models, the predicted values naturally drop during winter months.

The following plots show model estimates at the GADM-4 level for each month during 2022 for each of the targets. The final plot then shows the temporal patterns of the estimates from January 2021 to present.