We create a mapping based on some reference data referring to a small geographic area we can’t resolve with our data.

*explanation on capture rate*

We don’t know a priori how each cell contributes to the measurement of visitors and we create an iterative process to estabilsh the contribution of each cell.

Reference data have internal consistency that varies from location to location.

We first control how stable is the correlation adding noise.

*stability of correlation introducing noise*

We can show as well that noise at the hourly level does not change correlation as fast as for daily values.

*noise on reference data*

We see the effect of noise on reference data

*noise on reference data*

Already with 15% noise we can’t match the reference data with correlation 0.6 at a constant relative error.

If we look at historical data (variance of days with the same isocalendar date) we see that holiday have a big contribution in deviation.

*deviation on isocalendar*

We quantify the forcastability of a customer time series running a long short term memory on reference data.

We can see that some reference data are easy forcastable by the model

*example of good forecastable model*

While some are not understood from the neural network

*example of bad forecastable model*

We use the following job file to produce the activity with the filters:

*filter on activities*

- max 2 hours stay
- 20 km previous trip distance
- chirality (previous direction)

Filtering does not change the overall curves but just the niveau

Filtering options:

name | tripEx | distance | duration | chirality |
---|---|---|---|---|

t4 | 10.5 | 20km | 30min-2h | True |

t4_10 | 10.5 | 10km | 15min-2h | True |

t4_p11 | 11.6 | 20km | 15min-2h | True |

t4_p11_d20-notime | 11.6 | 20km | 15min-2h | True |

t4_p11_d10 | 11.6 | 10km | 15min-2h | True |

t4_p11_d20 | 11.6 | 20km | 15min-2h | True |

t4_p11_d30 | 11.6 | 30km | 15min-2h | True |

t4_p11_d40 | 11.6 | 40km | 15min-2h | True |

*effect of filters on curves*

*counts changed by filtering*

We calculate daily values on cilac basis.

The tarball is downloaded and processed with an etl script which uses the a function to unpack the tar and process the output with spark.

We filter the cells using only the first 20 ones whose centroid is close to the poi.

*some cells are not correlating within each other*

At first sight the sum of activities doesn’t mirror the reference data, neither in sum nor in correlation.

*sum of overall activities and visits*

We use linLeastSq to find the best linear weights for each cell contributing to measure the activities at the location.

```
def ser_sin(x,t,param): #weights times activities
return x*t.sum(axis=0)
def ser_fun_min(x,t,y,param): #minimizing total sum
return ser_sin(x,t,param).sum() - y.sum()
x0 = X.sum(axis=0) #starting values
x0 = x0/x0.mean() #normalization
res = least_squares(ser_fun_min,x0,args=(X,y,x0)) #least square optimization
beta_hat = res['x']
```

*function that optimizes cells weights minimizing the total sum difference wrt reference data*

We iterate over all locations to find the best weight set using a minimum number of cells (i.e. 5)

*linear regression on cells*

We iterate over the different filters and we see a slight improvement in input data which is, on the other hand, irrelevant after weighting.

*effect of filtering on correlation pre and post mapping*

Among all the different combination we prefere the version 11.6.1, 40 km previous distance, chirality, 0 to 2h dwelling time.

We see that chirality is stable but a third of counts doesn’t have a chirality assigned.

*daily values of chirality*

We monitor the effect of the different steps on the total correlation with reference values.

*correlation monitorin after any process step*

We use etl_dirCount to download with beautifulSoup all the processd day from the analyzer output and create a query to the postgres database.

A spark function loads into all the database outputs into a pandas dataframe, take the maximum of the incoming flow and correct the timezone.

*direction counts*

Now we check the stability of the capture rate over time and the min-max interval on monthly and daily values.

*capture rate over different locations which show a common pattern*

We see that the range of maximum and minimum capture rate is similar over different locations and we don’t have problematic outliers.

*capture rate boxplot over different locations*

Over the time different models were tested to select the one with the best performances and stability.

The suite basically consists in:

- api_temperature.py enrich location with weather information
- geo_octree.py flexible spatial grid and geometric algebric operations
- kpi_vis.py visualization of KPIs
- train_keras.py training with keras (tensorflow backend)
- learn_play.py regressor on a learn/play set split with cross validation
- bot_selenium.py bot to enrich location information with maps popularity lines
- series_lib.py signal processing libraries
- train_lib.py collections of predictors and dataset utilities
- train_modelList.py collection of models
- train_execute.py routines for data preparation and iterative learning
- geo_enrich.py enrich locations with geographical information
- kernel_lib.py collections of kernels for image processing and convolutions
- proc_text.py text parsing utilities to sort out research results
- train_shapeLib.py summarizes curve shapes into their most relevant statistical properties

We use learn_play to run a series of regressor on the time series.

Depending on the temporal resolution we use a different series of train_execute

We use kpi_viz to visualize the performances

*kpi boost*

We prepare a dataset overlapping the following information:

- reference data
- time
- activities
- footfall
- bast (isocalendar)
- historical data (isocalendar)
- mean temperature
- cloud cover
- humidity

We can see that not only all data sources are presents

*input data*

We than iterate over all locations and run a forecast over 30 days using a long short term memory algorithm train_longShort

*performance history*

We can see that the learning curve is pretty steep at the beginning and converges towards the plateau of the training set performance.

We start preparing the data for a first set of location where:

- high number of daily visits
- low chi square, skewness, variance on reference
- high correlation blind test
- low variance long short term memory
- large cell coverage

We summarized the most important metrics in this graph

*range of scores, relative values, 0 as minimum, areas represent confidence interval*

and summarized in this table.

Performances depend on the regularity of input data over time.

The cooperation with insight is required for:

*backup*in the project [short term]*validity proof*[short term]*optimization*of methods [short term]*productization*[long term]- harmonization of the
*xy source*[short term] - impact of
*new infrastructure*on activities and footfall [short term] - improve
*spatial resolution*for activities [long term] - improve
*activity filters*[long term]

TODOs:

- insert google popularity lines
- feature importance for every location
- evaluate the model on isocalendar for feb 2019
- retrain on new reference data and include footfall

After the finalization of the prediction of the reference data another prediction model has to be built to derive motorway guest counts for competitor locations.

This model will take into consideration:

- telco data
- cilac labelling
- geographical data (population density, land use)
- weather data