Dataset: senscape_traps
Contents
Dataset: senscape_traps¶
import requests
import pandas as pd
import src.utils as ut
from src import senscape
# Setup the root path of the application
project_path = ut.project_path()
# Load the metadata
meta_filename = [
f"{ut.project_path(1)}/meta/mosquito_alert/irideon_senscape_traps.json",
f"{ut.project_path(2)}/meta_ipynb/irideon_senscape_traps.html",
]
metadata = ut.load_metadata(meta_filename)
# Get contentUrl from metadata file
ut.info_meta(metadata)
1. Distribution by API¶
Before we are able to request data relative to irideon_senscape_traps dataset you need an API-key to get access. We have two available options: run a simple script that downloads by serial requests the data (see Section 1.1), or use a ad-hoc library which sends requests in parallel (see Section 1.2).
API_KEY = input("Enter Senscape API-key: ")
headers = {"Authorization": API_KEY}
# Get metadata
contentUrl, dataset_name, distr_name = ut.get_meta(
metadata, idx_distribution=0, idx_hasPart=None
)
# Make folders for data download
path = f"{project_path}/data/{dataset_name}/{distr_name}"
ut.makedirs(path)
1.1 Get dataset with simple script¶
This is a simple script example on how to download the complete dataset and how to update the CSV-file on demand. First, just get first five samples, make Pandas DataFrame and store as CSV
# Request just the first 5 records from the API
url = contentUrl
response = requests.get(
f"{url}?sortOrder=asc&sortField=record_time&pageSize=5", headers=headers
)
df = pd.DataFrame(response.json()["samples"])
# Save reports on CSV
filename = f"{path}/dataset"
df.to_csv(f"{filename}.csv")
At this point, load dataset from CSV, get record time of last sample and get all new samples in paginated download, add them to DataFrame and store as CSV
# Read CSV
df = pd.read_csv(f"{filename}.csv", index_col=0)
df.head
# Sort ascending by record time
df_sorted = df.sort_values(by="record_time")
# Get last record time
last_record_time = df.iloc[-1]["record_time"]
# last_record_time = startDate='2021-06-01T00:00:00.000Z'
page_number = 0
page_size = 100
query = (
f"filterStart={last_record_time}&sortOrder=asc&"
f"sortField=record_time&pageSize={page_size}&pageNumber={page_number}"
)
# Get first page and add to DataFrame
data = requests.get(f"{url}?{query}", headers=headers)
count = data.json()["count"] # Total count of samples newer than last record time
page_df = pd.DataFrame(data.json()["samples"]) # Make DataFrame of downloaded page
df = pd.concat([df, page_df], ignore_index=True) # Add DataFrame page to DataFrame
tot_pages = int(count / page_size) + 1
# Pagination loop to download the rest of the pages
for page_number in range(1, tot_pages):
data = requests.get(
f"{url}?filterStart={last_record_time}&sortOrder=asc&sortField=record_time&pageSize={page_size}&pageNumber={page_number}",
headers=headers,
)
page_df = pd.DataFrame(data.json()["samples"]) # Make DataFrame of downloaded page
df = pd.concat([df, page_df], ignore_index=True) # Add DataFrame page to DataFrame
print(f"Loop completed: {page_number}/{tot_pages}")
# Write to CSV
df.to_csv(f"{filename}.csv")
df.to_parquet(f"{filename}.parquet")
1.2. Get dataset with Senscape library¶
Another option is to use the senscape package that was made in order to feed the Mosquito Alert database table tigapublic_irideon. Its advantage over the above script is that it allows for multi thread requests (faster download time) and can be adapted to store data directly into a PostgreSQL database. On the other hand, it requires more dependencies to install.
# Set the request range for all the dataset
# startDate='2021-06-01T00:00:00.000Z', # example dateTime format
startDate = ""
count = senscape.getCount(headers, startDate=startDate)
# Alternative startDate if first request attempt fails because of history data limit
if count == 0:
startDate = "2020-08-05T00:00:00.000Z"
count = senscape.getCount(headers, startDate=startDate)
urls = senscape.getUrls(count, pageSize=1000, startDate=startDate)
# Get the data. Adjust timeout if needed since API's response could be slow
df = senscape.requestExecutor(
urls, headers, timeout=None, workers=4, set_sort_index="record_time", df_query=""
)
df.info()
# Save reports on CSV
filename = f"{path}/dataset"
df.to_csv(f"{filename}.csv")
df.to_parquet(f"{filename}.parquet")