Back prop from scratch 2022-10-02

my backprop SGD from scratch 2022-Aug
- 14:13 ok reviewing from last time ,
  - Yea so I had switched from relu to sigmoid on commit b88ef76daf , but yea log loss is still going up during training, so for sure got rid of the bug of how it did not make sense to map that relu output to a sigmoid since a relu only produces positive numbers and so the sigmoid therefore was only able to produce values greater than 0.5 anyway.
  - So at this point one thought I have for sure is whether this network is just one layer more complicated than would be needed for a problem set this simple. The thought arose after seeing that weight output from last training,
  - But in any case, I think for now I am curious if I can find more bugs.
- So, we are underfitting here. So the loss is just increasing steadily and I see the layer 1 and layer 2 weights are just increasing steadily as well. So makes me think this is related.
  - 16:02 let me try to observe the updates ,
```
  import network as n
  import dataset
  import plot
  import runner
  import ipdb
  import matplotlib.pyplot as plt
  import pylab
  from collections import Counter
  from utils import utc_now, utc_ts
  
  data = dataset.build_dataset_inside_outside_circle(0.5)
  parameters = {"learning_rate": 0.01,
                "steps": 50,
                "log_loss_every_k_steps": 10
  
               }
  
  runner.train_and_analysis(data, parameters)
  
```

17:26 ah well spotted one silly bug in tracking the metrics, so I had the train and validation loss I was logging flipped,

  if step % log_loss_every_k_steps == 0:
      _, total_loss = loss(model, data.X_validation, data.Y_validation)
      metrics["train"]["loss_vec"].append(total_loss)
  
      _, total_loss = loss(model, data.X_train, data.Y_train)
      metrics["validation"]["loss_vec"].append(total_loss)

fixed now so it is

  if step % log_loss_every_k_steps == 0:
      _, total_loss = loss(model, data.X_validation, data.Y_validation)
      metrics["validation"]["loss_vec"].append(total_loss)
  
      _, total_loss = loss(model, data.X_train, data.Y_train)
      metrics["train"]["loss_vec"].append(total_loss)

A bug indeed, but would not affect the training itself .

Ok I think good thing to do next, continue my low level debugging such that as I calculate g , if gradient descent is working properly, then I should be able to write an assert that I think the loss at least for the single example should decrease after applying g update, otherwise something is wrong !

19:27 ok to check this I then have to calculate the loss on the micro-batch I'm using here,

ok, first here is how I would reshape a single example to obtain its loss,

  i = 0 
  x, y = data.X_train[i], data.Y_train[i]
  x.shape, y.shape
  
  Y_actual, total_loss = n.loss(model, x.reshape((1, -1)), y.reshape((1, 1)))
  print("(x, y)", (x, y))
  print("Y_actual", Y_actual)
  print("loss", total_loss)

  (x, y) (array([ -7.55637702, -12.67353685]), 1)                                                                      
  Y_actual [0.93243955]
  loss 0.06995095896007311

And side not I realized technically I'm not plotting the training loss, since the training set has 9,000 rows and I'm only really using 500 or so of them so far. So I will adjust the training loss calculation for specifically that portion I use.

20:27 ok cool, going to try out this new code where I also now am logging the before and after for each microbatch loss

  import network as n
  import dataset
  import plot
  import runner
  import ipdb
  import matplotlib.pyplot as plt
  import pylab
  from collections import Counter
  from utils import utc_now, utc_ts
  
  data = dataset.build_dataset_inside_outside_circle(0.5)
  parameters = {"learning_rate": 0.01,
                "steps": 500,
                "log_loss_every_k_steps": 10
               }
  
  model, artifacts, metrics = runner.train_and_analysis(data, parameters)
   outer: 100%|█████████████████████████████████████████████████████████████████████| 500/500 [00:12<00:00, 40.35it/s]
                                                                                                                     saving to 2022-10-03T003158.png                                                                                      
  2022-10-03T003158.png
  2022-10-03T003159-weights.png
  2022-10-03T003200-hist.png
  saving to 2022-10-03T003201-scatter.png
  2022-10-03T003201-scatter.png

And let me look at those micro batch updates then

  In [8]: metrics["micro_batch_updates"][:5]
  Out[8]: 
  [{'loss_before': 0.43903926069642474,
    'y_actual_before': array([0.64465547]),
    'x': array([-9.44442228,  1.4129736 ]),
    'y': 1,
    'loss_after': 0.43757904199626413,
    'y_actual_after': array([0.6455975])},
   {'loss_before': 1.0263273283159982,
    'y_actual_before': array([0.64167946]),
    'x': array([-3.4136343 , 17.13301918]),
    'y': 0,
    'loss_after': 1.0309406841349795,
    'y_actual_after': array([0.64332871])},
   {'loss_before': 0.4300753021013386,
    'y_actual_before': array([0.65046011]),
    'x': array([-2.26675345, -5.20582749]),
    'y': 1,
    'loss_after': 0.4285424015973017,
    'y_actual_after': array([0.65145797])},
   {'loss_before': 1.0544704530873739,
    'y_actual_before': array([0.65162314]),
    'x': array([ 14.74873303, -16.34664216]),
    'y': 0,
    'loss_after': 1.0598453040833464,
    'y_actual_after': array([0.65349059])},
   {'loss_before': 0.42370781274874675,
    'y_actual_before': array([0.65461512]),
    'x': array([  1.71615885, -11.0142264 ]),
    'y': 1,
    'loss_after': 0.42217509911520096,
    'y_actual_after': array([0.65561923])}]
  
  import matplotlib.pyplot as plt
  from utils import utc_now, utc_ts
  import pylab
  
  deltas = [x["loss_after"] - x["loss_before"] for x in metrics["micro_batch_updates"]]
  with plt.style.context("fivethirtyeight"):
      plt.hist(deltas, bins=50)
      out_loc = f"{utc_ts(utc_now())}-micro-batch-loss-deltas.png"
      print("saving to", out_loc)
      pylab.savefig(out_loc, bbox_inches="tight")
      pylab.close()
      plt.close()
  
  # saving to 2022-10-03T005623-micro-batch-loss-deltas.png

Wow fascinating, so a lot of the loss is getting reduced, at least slightly more than not haha,

  In [17]: from collections import Counter
      ...: Counter(["loss_reduction" if x < 0 else "loss_increase" for x in [y for y in deltas if y != 0]])
  Out[17]: Counter({'loss_reduction': 260, 'loss_increase': 240})

And

  with plt.style.context("fivethirtyeight"):
      plt.plot(deltas)
      out_loc = f"{utc_ts(utc_now())}-micro-batch-loss-deltas-over-steps.png"
      print("saving to", out_loc)
      pylab.savefig(out_loc, bbox_inches="tight")
      pylab.close()
      plt.close()

But wow, this next plot is fascinating!

  with plt.style.context("fivethirtyeight"):
      fig = plt.figure(figsize =(20, 9))
  
      plt.plot(deltas, linewidth=0.7)
      plt.title("Microbatch loss_after - loss_before")
      out_loc = f"{utc_ts(utc_now())}-micro-batch-loss-deltas-over-steps.png"
      print("saving to", out_loc)
      pylab.savefig(out_loc, bbox_inches="tight")
      pylab.close()
      plt.close()

2022-10-03T010957-micro-batch-loss-deltas-over-steps.png

So according to the above, yes the microbatch delta loss is ping ponging back and forth and basically getting worse, for the different microbatch inputs . Wow. so glad I looked at this chronological kind of plot !