• Synthetic Dataset » NEUROChem Scenarios

    Neurochem Sensor Model v0.4.3
    Last update: 28 March 2011

    Download the zipped CSV sensor raw data files on http://neurochem.sisbio.recerca.upc.edu/data/scenarios/2011-March/. Currently, the access is allowed only to the NEUROChem parters, and the datasets are protected with login/password.

    1. Virtual Sensor Array
    2. CSV format
    3. Scenarios
    3.1. Classification scenario
    3.2. Segmentation scenario
    3.3. Sensor Damage scenario
    4. References

    Section 1. Virtual Sensor Array

    Design goals

    - Inspired on a real polymer sensor array UNIMAN dataset [2], from University of
    Manchester(Dr. Persaud)
    – Replicate the sensitivity structure of a real sensor array
    - Arbitrary number of sensors
    - Arbitrary concentration profiles in input
    - Released under Open Source License (expected mid 2011) [1]

    Developed Models

    - linear PLSr model for the steady-state signal from the UNIMAN database
    (described in Deliverable 8.3)
    – non-linear spline-based extension is available (but not used)
    - non-linear Langmuir-based model to simulate competitive sorption between
    polymer and analyte molecules on the sensor surface
    - simple time dynamics model
    - noise models
    – drift model (additive noise, multi-component)
    – sensor aging model (multiplicative noise)
    – simple concentration noise model (from gas camera)


    - three gases A, B and C (pure analytes or mixtures)
    - normalized analyte concentration from 0 to 1 (from 0% to 100%)
    – e.g. “A” is single gas A and “A33C77″ is mixture of A33% and C77%.
    - the number of sensors: any number
    – the default number of sensors: 1020
    - the number of features per sensor: temporal dynamics (length of 120 s) with
    the steady-state response at 60 s

    Reference UNIMAN database

    The dataset was obtained the facilities of the University of Manchester. Three
    gases at different concentration level were measured: ammonia (0.01%, 0.02%,
    0.05%), propanoic acid (0.01%, 0.02%, 0.05%), n-buthanol (0.01%, 0.1%). The
    experiments were repeated on a regular basis during 7-month. The sensor array
    was composed by 17 polymeric sensors. A total number of 3925 were acquired and
    labeled to mentioned gases and concentrations. The response of the sensors
    has 329s time-length, sampled at 1Hz frequency. The compound is induced to the
    sensor array at instant t=0s, then the clean air enters the chamber at instant

    For feature extraction, the data at instant t=180s is used from the sensor
    response, thus forming a 17-dimensional feature space from the array of 17
    sensors. The operation on removing the outliers was performed by means of the
    algorithm of Filzmoser et al. with the default parameters. Hence, the number of
    samples has been reduced from 3925 to 3484 [2].

    Section 2. CSV format

    The dataset format is directly depends on the scenario which the data is going
    to represent. Since the scenario consists of training and validation phases,
    the dataset is divided into groups “TrainingSet”, “ValidationSet” and
    “InterimSet” (the latter corresponds to inter-medium samples between
    training and validating).

    CSV file format

    The delimiter between fields is set to “,” symbol. The approximate file size for
    the data from 1020 sensors is 100 Mb (50 Mb in zip compression).

    CSV columns

    - s2, s2, … s1020: Columns with names starting with “s” mark the data from a
    - Gas: The label marks the analyte exposed to the array.
    - Set: The label indicates the the training/validation set of scenario.
    - cA, cB and cC: The concentration of gas component in the analyte.
    - time: The time passed in seconds.

    Example of CSV file (classification scenario)

    "s1",   "s2",   ...  "s1020", "Gas",   "Set",       "cA", "cB", "cC", "time"
    8.21872,8.20375,  ... 7.806619, "Air", "TrainingSet",   0,    0,   0,    1
    8.218893,8.20589, ... 7.794384, "Air", "TrainingSet",   0,    0,   0,    2

    The first line is the column names. The second and the third lines are the data.
    These two samples corresponds to air coming to the array during the first 2 s of
    the training phase. Neither of three gases A, B or C is presented, their
    concentration (the “cA”, “cB” and “cC” fields) are zero.

    Section 3. Scenarios

    Reduced list of scenarios

    The final list of scenarios to be demonstrated at the end of the NEUROChem
    project [3]:
    - Classification
    - Segmentation
    - Sensor damage
    - Context dependent behavior

    Scenario parameters

    - nT: the number of samples per analyte in training set
    – the default nT value: 30
    - nV: the number of samples per analyte in validation set
    – the default nV value: 30
    - difficulty: the difficulty of scenario
    (specially defined for each scenario)

    The samples per analyte consists of 60 s of the analyte explosion to the array
    and the following 60 s of the cleaning phase by air. The total sample size is
    120 s (with the steady-state at 60 s).

    Section 3.1. Classification scenario

    Definition of difficulty

    The difficulty for classification is defined through how similar analytes are.
    The analytes were selected as two mixtures of gases A and C.
    - difficulty 1: A and C
    - difficulty 2: A17C83 and A83C17
    - difficulty 3: A33C67 and A67C33
    - difficulty 4: A40C60 and A60C40
    - difficulty 5: A45C55 and A55C45


    - classification-sdata-1.csv.zip
    - classification-sdata-2.csv.zip
    - classification-sdata-3.csv.zip
    - classification-sdata-4.csv.zip
    - classification-sdata-5.csv.zip

    Preliminary tests

    Performance of KNN classifier (k=3) on the steady-state signal (baseline

    knn.perf                       file
    Exp. 1     1.00 sdata-classification-1.csv
    Exp. 2     0.98 sdata-classification-2.csv
    Exp. 3     1.00 sdata-classification-3.csv
    Exp. 4     0.92 sdata-classification-4.csv
    Exp. 5     0.67 sdata-classification-5.csv


    The first three classification scenarios of difficulty levels 1, 2 and 3 have
    similar KNN performance with the classification ratio about 100%. The last
    two scenarios of difficulty 4 and 5 show considerably lower performance.

    Section 3.2. Segmentation scenario

    Definition of difficulty

    Likewise the classification scenario, the parameter that will control the
    difficulty of the segmentation task will be the similarity between the odours
    to be segmented. The closer the odours, the more difficult will be to segment
    - difficulty 1: A and C (Training) and A50C50 (Validation)
    - difficulty 2: A and C (Training) and A45C55 (Validation)
    - difficulty 3: A and C (Training) and A60C40 (Validation)
    - difficulty 4: A and C (Training) and A67C33 (Validation)
    - difficulty 5: A and C (Training) and A83C17 (Validation)


    The synthetic sensors have more affinity to Gas A in respect to Gas C, which is
    simulated with the Sorption Model for analyte mixtures. Hence, the increase in
    the difficulty level corresponds to increasing of portion of Gas A in mixture,
    in order to induce more suppression of Gas C by Gas A (in validation phase
    for segmentation scenario).

    Section 3.3. Sensor Damage scenario

    The scenario is also named as Sensor replacement I.

    The datasets from Classification scenario, difficulty 3, (file:
    sdata-classification-3.csv) will be re-used. The parametrized proportion of
    sensors will be removed in Validation set. The signal from removed sensors will
    set to a baseline level with small normal noise (noise is needed only to
    visualize the data with PCA).

    The random selection of sensors to be removed is done with the same random seed.
    Hence, the list of removed sensors for scenario with difficulty 1 is presented in
    the list for scenario with the next level of difficulty, e.g. 2 or 3.

    Definition of difficulty

    The total percentage of sensors removed (damaged) will be used in this scenario
    as the difficulty parameter.
    - difficulty 1: 6.25%
    - difficulty 2: 12.5%
    - difficulty 3: 18.75%
    - difficulty 4: 25%
    - difficulty 5: 31.25%

    Preliminary tests

    Performance of KNN classifier (k=3) on the steady-state signal (baseline

    knn.perf                      file
    Exp. 1     0.98 sdata-sensor-damage-1.csv
    Exp. 2     0.87 sdata-sensor-damage-2.csv
    Exp. 3     0.50 sdata-sensor-damage-3.csv
    Exp. 4     0.50 sdata-sensor-damage-4.csv
    Exp. 5     0.50 sdata-sensor-damage-5.csv

    Section 4. References

    [1] Web page of NEUROChem Scenarios, http://neurochem.sisbio.recerca.upc.edu/?page_id=222

    [2] Drift compensation of gas sensor array data by common principal component
    analysis, A. Ziyatdinov, S. Marco, A. Chaudry, K. Persaud, P. Caminal and
    A. Perera, SAB, 2010, http://dx.doi.org/10.1016/j.snb.2009.11.034

    [3] A Large Scale Virtual Array of Non-selective Odour Sensors for the Neurochem Project, Andrey Ziyatdinov, Krishna Persaud , Santiago Marco , Alexandre Perera, Neurochem Workshop, March 2011, http://neurochem.sisbio.recerca.upc.edu/public/talks/NeurochemWorkshopTalkUPC.pdf