lernia features

Feature building from the lernia library

lernia lernia library

data sources

We collect weather data from darksky api

data_sources darksky weather api

And census data from eurostat

data_sources eurostat census data


For each feature is important to undestand statistical properties such as:

stat_prob statistical properties of a time series


We should evaluate the distribution of each variable, this view is confusing

norm_no no norm

An important operation for model convergence and performances is to normalize data. We can than see in one view all variances and skweness

norm_minmax minmax norm

Outliers can completly skew the distribution of variables and make learning difficult, we therefore remove the extreme percentiles

norm_5-95 norm 5-95

We remove percentiles and normalize to one

norm_1-99 norm 1-99

Correlation between features is important to exclude features which are derivable

feat_corr feature correlation

Outlier removal is not changing the correlation between features

feat_corr feature correlation

Some features have too many outliers, we decide to put a threshold and transform the feature into a logistic variable

norm_cat norm cat

Apart from boxplot is important to visualize data density and spot the multimodal distributions

norm_joyplot norm joyplot

We than sort the features and exclude highly skewed variables

norm_logistic norm logistic features

feature reduction

Looking at the 2d cross correlation we understand a lot about interaction between features

dimension selected features

And we can have a preliminary understanding about how features interacts

feature_2dcor feature 2d correlation

We know that apparent temperature is dependent from temperature, humidity, windSpeed, windBearing, cloudCover but we might not know why. Apparent temperature can be an important predictor so basically we can reduce the other components with a PCA

pca pca on derivate feature

Interestingly the first component explains most of the feature set but doesn’t explain the apparent temperature which is describes in the second component

pca components importance

For the same components we can investigate other metrics

feat_pairs feature pair metrics


replace NaNs

Working with python doesn’t leave many options, contrary to R almost any library return errors. We therfore interpolate or drop lines.

replace_nan replacing nans with interpolation

The main issue with interpolation is at the boundary, special cases should be treated

data cubes

If we have a lot of time series per location, or multiple signal superimposing we look at the chi square distribution to understand where outlier sequence windows are off

chi chi square distribution

We than replace the off windows with a neighboring cluster

replace replace volatile sequences

feature importance

Feature importance is a function that simple models return. Since models don’t agree on the same feature importance and production model will even come to much different conclusions.

featImp_norm feature importance no norm

Normalization stabilize agreement between models

featImp_norm feature importance norm

We can apply as well a feature regularisation checking against a Lasso or a Ridge regression which features are relevant for the predicting variable

feat_regularisation regularisation of features, mismatch in results depending on the regressor

We than iterate model trainings removing one feature per time and calculate performaces. We can than understand how much is every feature important for the training

feat_knock feature knock out

Strangely removing ozone and pressure the rain prediction suffers. We than analyze time series a realize a big gap in historical data and realize the few data where misleading for the model

feat_time feature time series


The meaning of building features is to achieve good predictability, if we want to predict rain we have differences in performace between models

predictability predictability, no norm

Cleaning features all the models perform basically the same

predictability predictability, normed

Same if we train on spatial feature on a binned prediction variable

predictability predictability, no norm

After feature cleaning we have better agreement between models

predictability predictability, normed


transform lines

Detailed information can be compressed fitting curves

time_series simplify complexity

For a many day time series we can distinguish periods from trends

time_series simplify complexity

transform dimesionality

Time series can be transformed in pictures

poly time series in pictures

Which is important to induce correlation between days and use more sofisticated methods

re../f/f_pred reference prediction

transform interpolate

We can interpolate data to have more precise information and induce correlation between neighbors

interpolate interpolate population density

transform distribution

If we want to know how dense is an area with a particular geo feature

spot building spot building distance

We can reduce the density of feature fitting the radial histogram and returning the convexity of the parabola

degeneracy spatial degeneracy, parabola convexity


If we apply boosting the distribution will change and therefore we can train another model to predict the residuals

stat_prob residual distribution