Considering the increasing pollution levels in the city and its harmful effects on kid’s health, an event
management firm has decided to conduct outdoor events only when Carbon monoxide levels are within
3ppm to 9ppm. For this, they need a model to know the expected daily maximum level of Carbon
Monoxide (CO) one week in advance.
To forecast the daily maximum Carbon Monoxide (CO) level for next one week (5th April 2005 to 11th
April 2005) by using data of various air pollutants including CO from 10th March 2004 to 4th April 2005.
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide
chemical sensors embedded in an Air Quality Chemical Multi sensor Device. Data were recorded from
10th March 2004 to 4th April 2005 (one year). Ground Truth hourly averaged concentrations for CO, NonMetallic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were
provided by a co-located reference certified analyzer.
Source: UCI machine learning repository- Air Quality data set
0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metallic Hydro Carbons concentration in micro g/m^3 (reference
5 True hourly averaged Benzene concentration in micro g/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
9 True hourly averaged NO2 concentration in micro g/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in Â°C
13 Relative Humidity (%)
14 AH Absolute Humidity
Data was found with missing values which were visible as “-200”. Data had monthly seasonality and was
also changing as per the days of the week, which could be because of the varying number of
automobiles (emitting air pollutants) on weekdays and weekends.
Curves of output variable (CO): (after replacing missing values)
Input Variables: After checking different available variables we decided that the following variables can
affect the CO levels:
• Daily maximum C6H6 (lag 8)
• Daily maximum T (lag 7)
• Daily maximum AH (lag 7)
• Monthly dummy variables
• Weekly dummy variables
Output Variable: Daily maximum CO concentration
TRAINING DATA = 11 months data set; VALIDATION DATA = 1 month data set
Final Forecasting Method
“Multi-Linear Regression Model”
• Multi-Linear regression model (best fit model) gave us the Root Mean Square Error equal to 1.2
which is much better than the Naïve prediction RMSE of 1.89.
• RMSE from the Best Fit model in PPM = (1.2 * 24.45)/28 = 1.1 PPM
Using allowable CO limit as 3pppm to 9ppm and assuming Normal distribution for the above range with
mean=6 and standard deviation=1.1, we find that P( Z > ((X- µ)/ϭ)) = P( Z > ((9-6)/1.1)) = 1- 0.9968 =
0.003. Therefore, the risk associated with the forecasted model is only 0.3%. Thus, Multi-Linear
regression model can be used to predict the daily Maximum CO level for next one week.
To be on the safer side, if the outdoor level CO (PPM) is > 8 PPM (about 1 Standard deviation away from
the threshold), it is advisable for the event management firm to not conduct the outdoor event.