Getting Started

1. Access metadata

Mosquito Alert Data Portal could be accessed from the Mosquito Alert webpage. In the main menu, hover over Open Data and chose _Mosquito Alert Data Portal and click on the link of the data portal.

The Accessibility section provides a summary table of the available datasets. The headers of table are:

  • Dataset: name of the dataset or data catalog (i.e. collection of datasets)

  • Project: name of the project or general label that is useful to group similar datasets

  • Description: short description of the dataset

  • License: specifies the type of license and the conditions of access (i.e. public or private)

  • Example: availability of a code-example for a given dataset

  • Format: specifies the dataset file format

This table provides a general overview of the datasets. However, to access to an exhaustive description of a dataset (i.e. metadata), just click on the dataset name of interest. Another way to access metadata is to navigate within the left-side contents menu. The metadata table of a given dataset attributes are defined by the Schema.org vocabulary. Some attributes (i.e. distribution, variableMeasured, measurementTechnique and creator) are hidden and should be expanded to show the contents.

2. How to get the data

Each dataset has a distribution attribute that provides a list of possibles ways to access the dataset (i.e. DataDownload type). For example, the reports dataset has two different ways of access:

  • Distribution from Zenodo cloud (i.e. zenodo)

  • Distribution from MosquitoAlert GitHub repository (i.e. mosquitoalert_github)

Both distributions access to the same dataset, but the download and data file read procedures are slightly different. Note that the encodingFormat of zenodo distribution is [JSON].ZIP. This means that the dataset is stored as a set (i.e. […]) of .json files, compressed in one .zip file. On the other hand, mosquitoalert_github provides direct download for each .json file. Thus, the dataset reports downloaded from zenodo should be decompressed and there is no way to download only a report for a given year. For this last, the mosquitoalert_github distribution could be used instead, since as detailed below it provides a direct download link to a year-chunk data file.

The attribute contentUrl provides the link to a dataset download. In case of the zenodo distribution of the dataset reports, this is just a DOI link generated by Zenodo when the dataset was published the first time to the repository. Open this link in a new browser’s tab to access the website of Zenodo where the dataset is stored. From here, it is possible to download a given version of the dataset or just the most recent one.

In contrast, the contentUrl of the mosquitoalert_github* distribution has a slightly different format in comparison with zenodo. It is a list of two urls, one for the yearly based reports files and another for a table of language translations. Note that the formatting {YEAR} in the first url should be substituted by a string corresponding to the year of interest. For example in Python, this substitution could be simply performed as

year = "2015"
url = "https://url_path/all_reports{YEAR}.json".format(YEAR=year)
print(url)
# "https://url_path/all_reports2015.json"

The following clip is another example of how to use a contentUrl with {...} is the tigapics_mosquitoalert dataset from the tigapics catalog (see DataCatalog structure) if the photo ID is known (see photos_clean dataset).

3. Dataset variables

Dataset variables are described in the measurementVariables metadata attribute. For each variable, a description and data types (i.e. qudt:dataType) are provided. Data types follow the XML Schema Definition standard. Optionally, units measurement information (i.e. unitText attribute) is also provided for physical quantities.

4. Python code-examples

Python scripts (i.e. jupyter notebooks) gives an example on how to programmatically access and read datasets. The list of available code-examples is provided in the dataset summary table of the Accessability section under the Example header. For a given dataset, for each distribution, a code-example is provided. The attribute workExample just gives a relative path to the code-example which is mainly used for website development purposes.

To run all the examples, one need to set-up a Python environment on the local machine (see subsection Local Python environment). However, it is also possible to run the examples relative to a Public datasets on the fly without any local installation of Python since we use MyBinder, an online service for building and sharing reproducible and interactive computational environments (see section MyBinder online service).

4.1 MyBinder online service

The MyBinder service could be used only for Public dataset access, since for security reasons it does not allow to fetch data from FTP sites and it is not advisable to type passwords into a running Binder session (e.g. access by SSH to another machine).

To start a Binder session, follow those steps:

  1. Start a Binder interactive session: Binder

    Note

    Sometimes, the server could be busy or down for maintenance, thus one should retry to enter.

  2. Wait for Binder to start-up (few seconds) or to build the image (few minutes)

  3. Once in the Binder session starts, navigate the left-side menu to the _sources/notebooks folder where the code-example files (.ipynb) are stored

  4. Open a code-example (e.g. reports.ipynb) and chose the Python kernel

  5. Execute cell codes in order and edit or modify the code if necessary

    Note

    It is not possible to save edited code in a Binder session, thus if needed export the hole file.

4.2 Local Python environment

One way to setup a local Python environment that runs the code-examples is by Conda package and environment manager. Conda uses the same configuration file (environment.yml) that MyBinder uses, thus both environments are very similar.

  1. Install Miniconda

  2. Download the Data Portal webpage from the GitHub repository

  3. Build a new conda virtual environment given the environment.yml configuration file that can be found in the root directory of the downloaded webpage:

    $ conda env create --name ma_data_portal --file environment.yml
    
  4. Install Jupyter Lab to run the code-examples stored in _sources/notebooks/ folder as notebooks:

    $ conda install -c conda-forge jupyterlab 
    
  5. Lunch the Jupyter Lab session:

    $ jupyter lab
    
  6. Chose the ma_data_portal kernel and run the code-examples in _sources/notebooks

5. DataCatalog structure

A data catalog is just a list of datasets described by a hasPart attribute. A catalog allows to avoid metadata redundancy and to group similar datasets together. As an example, the tigapics catalog contains datasets of photos generated by the Mosquito Alert application that have common attributes (i.e. license, citation, measurementTechnique and creator). Only the temporalCoverage and spatialCoverage are specified for each dataset since they may differ. The temporal and spatial coverage attributes given at the head of the data catalog gives a general information about the overall coverage.