Dataset: senscape_traps

import requests
import pandas as pd

import src.utils as ut
from src import senscape

# Setup the root path of the application
project_path = ut.project_path()

# Load the metadata

meta_filename = [
    f"{ut.project_path(1)}/meta/mosquito_alert/irideon_senscape_traps.json",
    f"{ut.project_path(2)}/meta_ipynb/irideon_senscape_traps.html",
]
metadata = ut.load_metadata(meta_filename)

# Get contentUrl from metadata file
ut.info_meta(metadata)

1. Distribution by API

Before we are able to request data relative to irideon_senscape_traps dataset you need an API-key to get access. We have two available options: run a simple script that downloads by serial requests the data (see Section 1.1), or use a ad-hoc library which sends requests in parallel (see Section 1.2).

API_KEY = input("Enter Senscape API-key: ")
headers = {"Authorization": API_KEY}
# Get metadata
contentUrl, dataset_name, distr_name = ut.get_meta(
    metadata, idx_distribution=0, idx_hasPart=None
)

# Make folders for data download
path = f"{project_path}/data/{dataset_name}/{distr_name}"
ut.makedirs(path)

1.1 Get dataset with simple script

This is a simple script example on how to download the complete dataset and how to update the CSV-file on demand. First, just get first five samples, make Pandas DataFrame and store as CSV

# Request just the first 5 records from the API
url = contentUrl
response = requests.get(
    f"{url}?sortOrder=asc&sortField=record_time&pageSize=5", headers=headers
)
df = pd.DataFrame(response.json()["samples"])
# Save reports on CSV
filename = f"{path}/dataset"
df.to_csv(f"{filename}.csv")

At this point, load dataset from CSV, get record time of last sample and get all new samples in paginated download, add them to DataFrame and store as CSV

# Read CSV
df = pd.read_csv(f"{filename}.csv", index_col=0)
df.head

# Sort ascending by record time
df_sorted = df.sort_values(by="record_time")

# Get last record time
last_record_time = df.iloc[-1]["record_time"]
# last_record_time = startDate='2021-06-01T00:00:00.000Z'

page_number = 0
page_size = 100

query = (
    f"filterStart={last_record_time}&sortOrder=asc&"
    f"sortField=record_time&pageSize={page_size}&pageNumber={page_number}"
)

# Get first page and add to DataFrame
data = requests.get(f"{url}?{query}", headers=headers)

count = data.json()["count"]  # Total count of samples newer than last record time
page_df = pd.DataFrame(data.json()["samples"])  # Make DataFrame of downloaded page
df = pd.concat([df, page_df], ignore_index=True)  # Add DataFrame page to DataFrame

tot_pages = int(count / page_size) + 1
# Pagination loop to download the rest of the pages
for page_number in range(1, tot_pages):
    data = requests.get(
        f"{url}?filterStart={last_record_time}&sortOrder=asc&sortField=record_time&pageSize={page_size}&pageNumber={page_number}",
        headers=headers,
    )
    page_df = pd.DataFrame(data.json()["samples"])  # Make DataFrame of downloaded page
    df = pd.concat([df, page_df], ignore_index=True)  # Add DataFrame page to DataFrame
    print(f"Loop completed: {page_number}/{tot_pages}")

# Write to CSV
df.to_csv(f"{filename}.csv")
df.to_parquet(f"{filename}.parquet")

1.2. Get dataset with Senscape library

Another option is to use the senscape package that was made in order to feed the Mosquito Alert database table tigapublic_irideon. Its advantage over the above script is that it allows for multi thread requests (faster download time) and can be adapted to store data directly into a PostgreSQL database. On the other hand, it requires more dependencies to install.

# Set the request range for all the dataset
# startDate='2021-06-01T00:00:00.000Z', # example dateTime format
startDate = ""
count = senscape.getCount(headers, startDate=startDate)

# Alternative startDate if first request attempt fails because of history data limit
if count == 0:
    startDate = "2020-08-05T00:00:00.000Z"
    count = senscape.getCount(headers, startDate=startDate)

urls = senscape.getUrls(count, pageSize=1000, startDate=startDate)

# Get the data. Adjust timeout if needed since API's response could be slow
df = senscape.requestExecutor(
    urls, headers, timeout=None, workers=4, set_sort_index="record_time", df_query=""
)
df.info()
# Save reports on CSV
filename = f"{path}/dataset"
df.to_csv(f"{filename}.csv")
df.to_parquet(f"{filename}.parquet")