Summary
This project is a reboot of my earlier project of predicting bicycle ride share riders destinations. https://bike-hop-predict.s3.amazonaws.com/index.html
This time around I used XGBoost, newer features, hyper parameter tuning and I have a Demo Site as well! I wanted very much to see if XGBoost has online learning like I used in an earlier TensorFlow project, but as I wrote here, I could not pick up where I left off at least the way I tried it. And here is a mini post on looking at hyper parameter tuning results. And here is a visual look at some of the new features I explored including.
Again, the data looks like this
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
"171","10/1/2015 00:00:02","10/1/2015 00:02:54","388","W 26 St & 10 Ave","40.749717753","-74.002950346","494","W 26 St & 8 Ave","40.74734825","-73.99723551","24302","Subscriber","1973","1"
"593","10/1/2015 00:00:02","10/1/2015 00:09:55","518","E 39 St & 2 Ave","40.74780373","-73.9734419","438","St Marks Pl & 1 Ave","40.72779126","-73.98564945","19904","Subscriber","1990","1"
"233","10/1/2015 00:00:11","10/1/2015 00:04:05","447","8 Ave & W 52 St","40.76370739","-73.9851615","447","8 Ave & W 52 St","40.76370739","-73.9851615","17797","Subscriber","1984","1"
"250","10/1/2015 00:00:15","10/1/2015 00:04:25","336","Sullivan St & Washington Sq","40.73047747","-73.99906065","223","W 13 St & 7 Ave","40.73781509","-73.99994661","23966","Subscriber","1984","1"
"528","10/1/2015 00:00:17","10/1/2015 00:09:05","3107","Bedford Ave & Nassau Ave","40.72311651","-73.95212324","539","Metropolitan Ave & Bedford Ave","40.71534825","-73.96024116","16246","Customer","","0"
"440","10/1/2015 00:00:17","10/1/2015 00:07:37","3107","Bedford Ave & Nassau Ave","40.72311651","-73.95212324","539","Metropolitan Ave & Bedford Ave","40.71534825","-73.96024116","23698","Customer","","0"
TOC
- Prior probability baseline
- Xgboost detour
- Multi class classification notes
- How does this time compare with the previous attempt
- Model highlights
- Gluing everything together
- Looking at hyperparameter tuning results
- Follow on
Prior probability baseline
Here, I first wanted to get a metric baseline using a simple model which only uses the highest prior probability destination as the prediction for a source bike share station. Anything even slightly more sophisticated should perform better. I also used this opportunity to apply multi class logloss as an evaluation metric for this problem, which I had not tried last time. So for an output probability vector of 54
possible destination stations, log loss can more granularly assess the prediction probabilities against a vector of the correct station, [0, 1, 0, 0, 0,...]
compared to just accuracy
.
For example
from sklearn.metrics import log_loss
# and some noisy predictions
noisy_pred = np.array([[.05, .05, .9],
[.95, 0.05, 0],
[.9, 0.1, 0],
[0.05, .05, .9],
[0, 1, 0]])
log_loss([3, 1, 1, 3, 2],
noisy_pred)
the output here is 0.07347496827220674
, which is just slightly worse than the perfect 0.
, showing that log loss can be handy for comparing models.
The detail is in the notes, but basically the cross validation log loss using this method ended up being
array([29.03426394, 25.61716199, 29.19083979, 28.312853 , 22.04601817])
Dockerization
Next, for repeatability and portability, here I re-adapted some earlier Dockerization I had setup before to wrap xgboost, along with jupyter notebook for experimentation. This was crucial, because if you want to jump between some quick experiments on your laptop and a notebook in the cloud, you don’t want to deal with strange differences in library dependencies between MacOs and linux.
Xgboost detour
To start, I wanted to better understand how to use Xgboost abilities with respect to training a model, putting it down, saving it to disk, loading it again and continuing to train on new data. I had used this capability in Tensorflow land earlier and I read it might be possible with Xgboost, but even after trial and error with both the main Xgboost API and its scikit learn API, I could not get this to work properly. My notes on this are here in an earlier post.
One cool thing I did learn however was that when repeating a model train and evaluation experiment with both the functional API and the scikit learn API, the functional API took advantage of multithreading, and produced a particular result in 4min 18s
vs 49min 6s
, with both models using the same seed=42
and ending up with the same accuracy and log loss on some held out data.
As I mentioned here , I experienced some early problems running out of memory and crashing, for instance computing log los son 843416 rows
. And that is why I was seeking out approaches of online learning. But because of the limitations, my workout ended up being the use of at least carefully deleting objects in memory with del
to free up space for, between preprocessing, training and validation. And I also played around with the approach of initializing a xgb.DMatrix
using the xgb.DMatrix('/my/blah/path#dtrain.cache')
syntax where you specify #
a cache file to allow for file access to reduce the in-memory burden, also requiring to dump your pre-processed training data to file first. (And doing that is good anyway because it allows you to free up that precious memory).
Compared to the initial baseline logloss from earlier of around 29
, here I noted a result of 3.9934347
with the initial xgboost approach.
On 2020-06-14, I tried using the xgboost caching with the scikitlearn api approach. In the meantime I also ran into a fun issue where an xgboost model was trained on data without a particular output class , with only 53
classes in fact and would produce predict probability vectors of length 53
instead of 54
, so I ended up having to make sure to better shuffle the data to make sure when trying to use less data (when using cross validation for instance) that all of the output classes are accounted for, without having a more direct way of telling Xgboost what the output classes should be.
Also another fun Tensorflow comparison was I got XGBoostError: need to call fit or load_model beforehand
when trying to call predict on a bare model that had not undergone training. Whereas with Tensorflow, I experienced in a previous project that this is absolutely fine, because you simply have a fully formed neural network with some randomly (or otherwise) initialized weights. But with xgboost, or at least the particular implementation I was using, this is not possible, because there is no notion of a base model.
Here, I tried cutting up the training data like
clf = xgb.XGBClassifier()
workdir = fu.make_work_dir(); print(workdir)
fu.log(workdir, 'Starting')
prev_model = None
loss_vec = []; acc_vec = []
for i, part in enumerate(tqdm(parts)):
clf.fit(X_transformed[part], y_enc[part], xgb_model=prev_model)
fu.log(workdir, f'[{i}] Done fit')
prev_model = f'{workdir}/model.xg'
clf.save_model(prev_model)
y_prob_vec = clf.predict_proba(X_test_transformed)
fu.log(workdir, f'[{i}] Done predict_proba')
loss = fu.big_logloss(y_test_enc, y_prob_vec, labels)
fu.log(workdir, f'[{i}] Done big_logloss, loss={loss}.')
loss_vec.append(loss)
acc = accuracy_score(y_test_enc, np.argmax(y_prob_vec, axis=1))
acc_vec.append(acc)
fu.log(workdir, f'[{i}] Done accuracy, acc={acc}.')
to see if the scikit learn API can allow saving and restoring previously trained models and continuing, with the fit(X, xgb_model=prev_model)
syntax, but the output performance data was basically random indicating to me that the fit
was starting from scratch each time.
Here, below, is a plot of accuracy after multiple epochs, just to visually show the lack of any progression. (This plot is from a similar experiment in my 2020-06-16 notebook ).
So basically I gave up on this approach for using xgboost.
Also in that notebook, I found I was oddly getting 0
logloss during some experimentation because I had like in this toy example below, been specifying labels to the log_loss
func, not matching the actual y_true
data (which is the first arg).
a, b = np.array([0, 0, 0]), np.array([[0, 0., 0],
[0., 0, 0],
[0., 0, 0]])
print(log_loss(a, b, labels=['a', 'b', 'c']))
# => 0.0
2020-06-19
Here , because a lot of the test set prediction for model evaluation takes time I ended up creating a mini parallelization func. I verified that it was producing roughly the same validation and the time to execute was less.
I also wrote about how I had needed to use less data to avoid crashing, using the pandas sample()
func like,
tripsdf = pd.read_csv(f'{datadir}/2013-07 - Citi Bike trip data.csv'
).sample(frac=0.017, random_state=42)
but that it would be better to build a more balanced dataset instead of just random sampling.
A rapid fire list of some additional experiments
- Here, I had another training attempt using the so called “cache data” with functional api. (But finding especially here that the
max_depth
was not changing so likely no learning was happening). - Here I aadded new features here for a ‘v2’ dataset, including
weekday
. andtime_of_day
. Added this in a new module,fresh/preproc/v2.py
. - Here , more memory struggles (especially since I added more data).
- Here, describing that after lots of crashing, starting to use numpy append mode, here , to allow for doing preprocessing in chunks. And starting to look at target class distribution.
- Here I’m discovering that it is quite possible that this caching is only allowed w/ the “libsvm” format!
- And here, I see hmm it is kind of weird, that with cache, without… producing different
feature_names
? more kernel dying! - Here , bigger box?
- Here , class distribution for size reduction and dataset rebalancing.
- Here ok I took the result from the balancing/shrinking concept from the “2020-06-29.ipynb” notebook and tried to use less data see if we can avoid a kernel crash that happened in “2020-06-28-take2.ipynb”.
- Here I wanted to do a quick recalc of yesterday’s model using the sklearn API Again.
- Here, more rebalancing.
- Here.
- Here, end result: compared with “2020-07-03-aws.md” , I am not really seeing much of a difference. the balanced test accuracy perhaps looks every so slightly better but probably not significantly.
- Here , change up
'subsample'
and'max_depth'
, measuring logloss, accuracy and balanced_accuracy , there are some noticable changes in logloss, but overall the changes are probably not significant. - Here.
- Here.
- Here.
- Here.
Multi class classification notes
Understanding tuning results
hyper parameter tuning and train/test acc
Previously vs This time
- Last time around, I segmented the starting data into
24
hour-long segments. This time, I segmented time into only5
bins to make the model slightly more generalizable.
# time_of_day
0: 6-9,
1: 10-13,
2: 14-16,
3: 17-21,
4: 22-0, 0-5
Actually after picking these bins arbitrarily, I ended up also looking at the time of day histograms here and the peaks look close to what I had as a mental model in my mind. It might be interesting try some other bins at some point later.
One other new feature this time is the binary
weekday
feature, specifying weekday vs weekend.The starting neighborhood one hot encoded was kept as an input.
Also last time around, the main model was a Random Forest classifier, but using XGBoost this time.
Model Highlights
The top model has these stats…
(pandars3) $ docker run -p 8889:8889 -p 8080:8080 -i -t -v $(pwd):/opt/program \
-v ${MY_LOCAL_DATA_DIRECTORY}:/opt/data \
-v ~/Downloads:/opt/downloads \
-v $(pwd)/artifacts/2020-08-19T144654Z:/opt/ml \
citibike-learn:latest \
import fresh.predict_utils as fpu
bundle = fpu.load_bundle_in_docker()
In [7]: bundle['model_bundle']['bundle']['validation_metrics']
Out[7]:
{'accuracy': 0.12171455130090014,
'balanced_accuracy': 0.10451301995291779,
'confusion': array([[415, 64, 4, ..., 0, 103, 69],
[ 56, 541, 4, ..., 0, 130, 27],
[ 23, 10, 136, ..., 0, 16, 130],
...,
[ 2, 0, 2, ..., 1, 3, 36],
[151, 222, 3, ..., 0, 260, 35],
[ 84, 25, 46, ..., 0, 29, 861]]),
'logloss': 3.4335361255637977,
'test': '/home/ec2-user/SageMaker/learn-citibike/artifacts/2020-07-08T143732Z/test.libsvm',
'karea': 0.760827309330065}
- More on the “k-area” metric is here
Top Model’s Top Fscore features
Extracting from this notebook ,
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
And it can be interesting to look at a random tree from xgboost too sometimes, again extracting from the above mentioned notebook.
Gluing everything together
In this notebook, I face the challenges of taking the model from bundle to a demo site. There were a lot of challenges involved. My concept was to use the Google Static Map API to display the top neighborhood predictions. Hitting this API properly did take a little bit of time, but it was not that bad. And later on, I updated the whole AWS Lambda approach so the lambda function calls the API with the result from the dockerized SageMaker served model.
Admittedly, the most time consuming part was figuring out the API Gateway Cognito “Unauthenticated Authentication”. AWS has this Cognito service which manages user/password based authentication for you but it also lets you use Anonymous authentication. But there must be a lot of degrees of freedom in how this is used, because I could not find good documentation on how to set this up properly for my usecase at all.
I had used API Gateway for authentication through CORS in the past and I recalled a bit of nuance that for example you may have setup CORS properly for 200
status codes, but if your program crashes with a 500
then your browser will scream about a CORS error, because the response is not returning the expected allow-origin-blah
header. In the past this had taken me a while to figure out, but now I luckily had that knowledge in my back pocket. In any case, it is worth it for the serverless approach.
Automation made the process very convenient
https://github.com/namoopsoo/learn-citibike/blob/master/notes/2020-06-07-local-docker-notes.md
I also described my build process in the earlier mentioned glue notes too. With so many tweaks to the python side, the model and the javascript side, being able to build and deploy with quick make
style commands made everything faster. I document some of these here too.
quick pearson’s chi squared independence test
quick pearson’s chi squared independence test https://github.com/namoopsoo/learn-citibike/blob/master/notes/2020-07-05.md
Looking at hyperparameter tuning results
( EDIT: After writing the below section, I realized I had already here earlier, on 2020-07-24 , described some of these results already haha. Doing the work twice, forgetting what I had done. )
I spent a bit of time on hyper parameter tuning, looking at the results, fixing some parameters two focus on two others at a time.
So per here , the
num_round
as expected improves logloss,
keep_fixed = {
'max_depth': 3,
'learning_rate': 0.01,
'colsample_bylevel': 0.1,
'colsample_bynode': 1,
'colsample_bytree': 0.1,
'subsample': 0.1,
'num_round': 10,
}
col1, col2, metric_col = 'max_depth', 'num_round', 'logloss'
fp.compare_tuning(df, feature_col_1=col1,
feature_col_2=col2,
metric_col=metric_col,
keep_fixed=fvu.without(
keep_fixed, keys=[col1, col2]))
And maybe this is good as a sanity check, but more rounds take more time, ( per here )
And from here it was interesting to see that walltime is stable mostly when it comes to learning rate except sometimes…
And per here at least per the fixed parameters, the 0.1
learning rate is better than the 0.01
learning rate.
And per here , subsample
row sampling is just not appearing to be influencing accuracy.
And per here , the random column sampling may have just removed the good columns
Train and test accuracy comparison
- Here , I took all of my
1000+
models from earlier, (which were on S3 so I had to copy them locally for convenience) and calculated accuracy, logloss and karea metrics for the training data, in order to be able to get learning curves to understand underfitting/overfitting. - Just showing ihere an example run for one model…
# As per https://github.com/namoopsoo/learn-citibike/blob/2020-revisit/notes/2020-07-10-aws.md
# the data dir was artifacts/2020-07-08T143732Z ... going to re-create that locally too
#
datadir = '/opt/program/artifacts/2020-07-08T143732Z'
artifactsdir = '/opt/program/artifacts/2020-07-10T135910Z'
train_results = []
train_loc = f'{datadir}/train.libsvm'
dtrain = xgb.DMatrix(f'{train_loc}?format=libsvm')
actuals = dtrain.get_label()
print('evaluate using ', train_loc)
train_data = load_svmlight_file(train_loc)
X_train = train_data[0].toarray()
y_train = train_data[1]
%%time
########
# Try one
i = 0
bundle = joblib.load(f'{artifactsdir}/{i}_bundle_with_metrics.joblib')
model = bundle['xgb_model']
y_prob_vec = model.predict(dtrain)
predictions = np.argmax(y_prob_vec, axis=1)
logloss = fu.big_logloss(actuals, y_prob=y_prob_vec,
labels= list(range(54)))
acc = accuracy_score(actuals, predictions)
balanced_acc = balanced_accuracy_score(actuals, predictions)
correct_kth, karea = fm.kth_area(y_train, y_prob_vec,
num_classes=54)
CPU times: user 31.3 s, sys: 110 ms, total: 31.4 s
Wall time: 21.4 s
acc, balanced_acc, karea
(0.05276320740101365,
0.03727538888502701,
0.6435250908504123)
The whole thing, took about 10 hours
as measured by the final line from tqdm
,
100%|█████████▉| 1052/1054 [10:00:57<01:08, 34.27s/it]
Putting that together,
Here putting that together ..
training and test accuracy are pretty consistently close, with training accuracy being slightly better as expected. So there is no evidence overall of overfitting. But perhaps some evidence of underfitting .
The first thing I just looked at the parameters fixed by just what happened to be the first model built, so pretty arbitrary and comparing over the number of rounds. The results not unexpected not showing much learning happening.
Then I took the model with the best test accuracy results,
best_params = dict(alldf.sort_values(by='acc').iloc[-1])
best_params
{'train_acc': 0.12693459297270465,
'train_balanced_acc': 0.11012147901980039,
'i': 755,
'train_logloss': 3.4301962566050057,
'train_karea': 0.76345208497788,
'max_depth': 4,
'learning_rate': 0.1,
'objective': 'multi:softprob',
'num_class': 54,
'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 1.0,
'colsample_bynode': 1,
'colsample_bytree': 1.0,
'gamma': 0,
'max_delta_step': 0,
'min_child_weight': 1,
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'seed': 42,
'subsample': 0.4,
'verbosity': 0,
'acc': 0.12304248437307332,
'balanced_acc': 0.10551953202851949,
'logloss': 3.4480742986458592,
'walltime': 1918.593945,
'karea': 0.75845582462009,
'num_round': 100}
And plotted all the train/test metrics across rounds, and this figure definitely shows learning happening . Very exciting!
Also looked for the biggest gap between train/test accuracy
- And per the below, interestingly, it’s seeming like the biggest train/test gap is very small..
alldf['train_test_acc_delta'] = alldf.apply(lambda x: abs(x['acc'] - x['train_acc']), axis=1)
alldf.sort_values(by='train_test_acc_delta').iloc[-1]
train_acc 0.128123
train_balanced_acc 0.111239
i 1241
train_logloss 3.40954
train_karea 0.767823
max_depth 5
learning_rate 0.1
objective multi:softprob
num_class 54
base_score 0.5
booster gbtree
colsample_bylevel 1
colsample_bynode 1
colsample_bytree 1
gamma 0
max_delta_step 0
min_child_weight 1
random_state 0
reg_alpha 0
reg_lambda 1
scale_pos_weight 1
seed 42
subsample 0.4
verbosity 0
acc 0.12253
balanced_acc 0.104698
logloss 3.43584
walltime 2327.88
karea 0.760578
num_round 100
train_test_acc_delta 0.00559313
Name: 1242, dtype: object
Initial time of day look
https://github.com/namoopsoo/learn-citibike/blob/master/notes/2020-08-05-woe.md
super
discuss
https://github.com/namoopsoo/learn-citibike/blob/2020-oct/notes/2020-08-25-glue.md
Feature importances
From the many hyper parameter tuning jobs I had run, I used the xgboost feature importance functionality to dump the perceived feature importances for all of the models. And in the notes I plotted feature importances against accuracy for all of them.
For example, here are some of the more interesting plots,
The point here is that I had one hot encoded all of the starting neighborhoods. I am hoping of course that if a particular starting location looks important, then that should mean it is important in discriminating where you go next. Meaning it narrows down where you go. On the other hand, if your starting location is boring then that should mean it is more like a hub and there are too many destinations for the start along to be a helpful feature.
In the above plots, there is a wide range of models and they are showing that for some reason high importance does not necessarily mean high accuracy. If anything, I want to make a mental note that maybe these kinds of plots can be indicators of something wrong and some kind of under-fitting in particular. Or weak fitting at least. And one of the other scenarios is that fitting is weak, because there is not enough entropy in the data available to yield helpful discrimination with a model. No matter how well XGBoost can extract information, if the raw material does not have any diamonds, then we will be stuck.
The other thought is that there is an overfitting danger around not just an imbalance in the target variable (aka the destination neighborhood) but an imbalance in the starting locations too. This is why it would be really interesting to also look at the entropy of the multiclass outputs for signs of clear uncertainty for specific examples. Putting a pin on this in the follow-on section
The time of day features look like this, below, but again, this is not to say that these views represent the full story.
Thinking about this abit more in retrospect, these particular representations are probably not very meaningful to look at because if there are trends they need to be looked at “localizing” or “fixing” some of the parameters. Because these representations are all over the place but the relationship may still be hidden inside.
I think one of the top follow ons has to be to find better time of day splits. I chose my time of day splits based on a model in my head, and so there is definitely some room for exploration here.
Follow On
Time of day more splitting exploration
Find some more interesting techniques to try out different segmentation of the time of day. ( I’m thinking “adaptive binning " as described here )
Better understanding of model uncertainty
- As discussed in the feature importances section, it would be really interesting to take the test dataset and for the output probability vectors of all of the examples, to calculate the multi-class entropy, to see if indeed high uncertainty is associated with worse correctness rank (
kth accuracy
andkarea
in other terminology I have been using). - Of course this is really tricky from an Active Learning point of view, because I can see a scenario where adding more training examples around the cases which have a higher uncertainty may improve the accuracy for the related test examples , but that feels like there is a risk of overfitting to the test set. In any case, however, if the live data is not reflective of the training/test data distributions ( covariate shift ), then refreshing the model is important.
Some lessons for the future
Approach to training and artifacts
Training and hyperparameter tuning takes a long time. Dumping artifacts along the way, including models and results (for example using json), is helpful to allow another notebook to actively monitor the results as they are running. And doing this is also helpful because notebooks that run long experiments can sometimes crash. So it is nice to save intermediary results.
Notebooks
I like the concept of keeping a daily notebook, because keeping several experiments in one notebook can risk running out of memory and sometimes it is difficult to load large notebooks on github, even if they are turned into markdown, if there are a lot of images.
Write sooner rather than later
- Although it is tempting to just keep trying more and more experiments and to keep iterating the frontier forward, I think a difficult lesson to learn is that putting together the results of the day or the week takes much more time when done weeks or months later. I think summarizing and discussing your results as you go along is way more useful.
- But if you do wait, another idae is to just create a notebook table of concents as I am doing below, as a way of having quick chronological reference about the work that was done.
Notebooks TOC
- 2020-07-10 , like “2020-07-09-aws” , another hyperparameter tuning round here.
max_depth
,subsample
,colsample_bytree
. - 2020-07-11 , here I plot a bunch of results (on my laptop) , from the 2020-07-10 notebook running on aws.
- 2020-07-16-local.md , recalculataing train metrics for the ~1250 or so models from the hyper parameter tuning session
- 2020-07-26-feature-importances.md , looking at feature importances , reverse engineering my
proc_bundle
, to get back my list of feature names, which I had not done originally. Initially tryingmodel.get_score()
, dumping from each model. This actually took3.5 hours
. I plotted features and accuracy in a few ways to try to gauge features being more oftan associated with high accuracy models. Plotting the correlation of feature importance and acuracy. I think this was not a super useful method. Ultimately, the fscore approach was better - 2020-08-05-woe.md , EDA on the time_of_day feature, visual histogram comparisons. Not the most fruitful however.
- 2020-08-17-bundle-glue.md
- 2020-08-18-glue.md some reverse engineering to repurpose my preprocessor bundle for live etraffic. And combining preprocessor and model to make a joblib bundle with everything in it. And drafint a
full_predict
method. - 2020-08-22-static-map-api.md getting setup with the Google Static Map API . Very nice.
- 2020-08-25-glue.md Docker entry code end to end live code. And building the code for the lambda that calls Docker. Unfortunately xgboost does not fit on the lambda. And oops lambda cannot write to the file system. And working through the new API Gateway authentication methods here. I wrote some support code for quick lambda deployment because I ended up using many iterations to get this right. Content type weirdness. Javascript plus cognito. This was not documented very well, so a lot of blundering here. Can’t believe I finally made all of this work. This was insane.
- 2020-10-20-karea-worst.md K Area worst case scenario.
- 2020-10-21-look-at-model-plot.md looking at Fscore and as well as plotting individual trees with graphviz . Also some interesting issues with versions of xgboost in docker and lack of backward compatibility.
- 2020-10-21-uncertainty-xgboost.md this is mainly just a footnote about the idea around measuring uncertainty in xgboost. But this is likely not super reliable.
- 2020-10-22-features-v3.md Take a quick look at time of day distribution
- 2020-10-23-quick-new-v3-proc-bundle.md one more model iteration using new features.
- 2020-10-25.md evaluate new v3. But although not yet done any tuning, so far this does not seem significantly better, with karea
0.761
versus earlier best0.760
karea.