• my backprop SGD from scratch 2022-Aug
    • 14:13 ok reviewing from last time ,
      • Yea so I had switched from relu to sigmoid on commit b88ef76daf , but yea log loss is still going up during training, so for sure got rid of the bug of how it did not make sense to map that relu output to a sigmoid since a relu only produces positive numbers and so the sigmoid therefore was only able to produce values greater than 0.5 anyway.
      • So at this point one thought I have for sure is whether this network is just one layer more complicated than would be needed for a problem set this simple. The thought arose after seeing that weight output from last training,



      • But in any case, I think for now I am curious if I can find more bugs.
    • So, we are underfitting here. So the loss is just increasing steadily and I see the layer 1 and layer 2 weights are just increasing steadily as well. So makes me think this is related.
      • 16:02 let me try to observe the updates ,


          import network as n
          import dataset
          import plot
          import runner
          import ipdb
          import matplotlib.pyplot as plt
          import pylab
          from collections import Counter
          from utils import utc_now, utc_ts
          
          data = dataset.build_dataset_inside_outside_circle(0.5)
          parameters = {"learning_rate": 0.01,
                        "steps": 50,
                        "log_loss_every_k_steps": 10
          
                       }
          
          runner.train_and_analysis(data, parameters)
          
        
        
  • 17:26 ah well spotted one silly bug in tracking the metrics, so I had the train and validation loss I was logging flipped,


      if step % log_loss_every_k_steps == 0:
          _, total_loss = loss(model, data.X_validation, data.Y_validation)
          metrics["train"]["loss_vec"].append(total_loss)
      
          _, total_loss = loss(model, data.X_train, data.Y_train)
          metrics["validation"]["loss_vec"].append(total_loss)
    
    

    fixed now so it is

      if step % log_loss_every_k_steps == 0:
          _, total_loss = loss(model, data.X_validation, data.Y_validation)
          metrics["validation"]["loss_vec"].append(total_loss)
      
          _, total_loss = loss(model, data.X_train, data.Y_train)
          metrics["train"]["loss_vec"].append(total_loss)
      
    
    

    A bug indeed, but would not affect the training itself .

  • Ok I think good thing to do next, continue my low level debugging such that as I calculate g , if gradient descent is working properly, then I should be able to write an assert that I think the loss at least for the single example should decrease after applying g update, otherwise something is wrong !
  • 19:27 ok to check this I then have to calculate the loss on the micro-batch I'm using here,
    • ok, first here is how I would reshape a single example to obtain its loss,


        i = 0 
        x, y = data.X_train[i], data.Y_train[i]
        x.shape, y.shape
        
        Y_actual, total_loss = n.loss(model, x.reshape((1, -1)), y.reshape((1, 1)))
        print("(x, y)", (x, y))
        print("Y_actual", Y_actual)
        print("loss", total_loss)
      
      
        (x, y) (array([ -7.55637702, -12.67353685]), 1)                                                                      
        Y_actual [0.93243955]
        loss 0.06995095896007311
      
      
    • And side not I realized technically I'm not plotting the training loss, since the training set has 9,000 rows and I'm only really using 500 or so of them so far. So I will adjust the training loss calculation for specifically that portion I use.
    • 20:27 ok cool, going to try out this new code where I also now am logging the before and after for each microbatch loss
        import network as n
        import dataset
        import plot
        import runner
        import ipdb
        import matplotlib.pyplot as plt
        import pylab
        from collections import Counter
        from utils import utc_now, utc_ts
        
        data = dataset.build_dataset_inside_outside_circle(0.5)
        parameters = {"learning_rate": 0.01,
                      "steps": 500,
                      "log_loss_every_k_steps": 10
                     }
        
        model, artifacts, metrics = runner.train_and_analysis(data, parameters)
         outer: 100%|█████████████████████████████████████████████████████████████████████| 500/500 [00:12<00:00, 40.35it/s]
                                                                                                                           saving to 2022-10-03T003158.png                                                                                      
        2022-10-03T003158.png
        2022-10-03T003159-weights.png
        2022-10-03T003200-hist.png
        saving to 2022-10-03T003201-scatter.png
        2022-10-03T003201-scatter.png
      
      

      And let me look at those micro batch updates then

        In [8]: metrics["micro_batch_updates"][:5]
        Out[8]: 
        [{'loss_before': 0.43903926069642474,
          'y_actual_before': array([0.64465547]),
          'x': array([-9.44442228,  1.4129736 ]),
          'y': 1,
          'loss_after': 0.43757904199626413,
          'y_actual_after': array([0.6455975])},
         {'loss_before': 1.0263273283159982,
          'y_actual_before': array([0.64167946]),
          'x': array([-3.4136343 , 17.13301918]),
          'y': 0,
          'loss_after': 1.0309406841349795,
          'y_actual_after': array([0.64332871])},
         {'loss_before': 0.4300753021013386,
          'y_actual_before': array([0.65046011]),
          'x': array([-2.26675345, -5.20582749]),
          'y': 1,
          'loss_after': 0.4285424015973017,
          'y_actual_after': array([0.65145797])},
         {'loss_before': 1.0544704530873739,
          'y_actual_before': array([0.65162314]),
          'x': array([ 14.74873303, -16.34664216]),
          'y': 0,
          'loss_after': 1.0598453040833464,
          'y_actual_after': array([0.65349059])},
         {'loss_before': 0.42370781274874675,
          'y_actual_before': array([0.65461512]),
          'x': array([  1.71615885, -11.0142264 ]),
          'y': 1,
          'loss_after': 0.42217509911520096,
          'y_actual_after': array([0.65561923])}]
        
        import matplotlib.pyplot as plt
        from utils import utc_now, utc_ts
        import pylab
        
        deltas = [x["loss_after"] - x["loss_before"] for x in metrics["micro_batch_updates"]]
        with plt.style.context("fivethirtyeight"):
            plt.hist(deltas, bins=50)
            out_loc = f"{utc_ts(utc_now())}-micro-batch-loss-deltas.png"
            print("saving to", out_loc)
            pylab.savefig(out_loc, bbox_inches="tight")
            pylab.close()
            plt.close()
        
        # saving to 2022-10-03T005623-micro-batch-loss-deltas.png
      
      


    • Wow fascinating, so a lot of the loss is getting reduced, at least slightly more than not haha,
        In [17]: from collections import Counter
            ...: Counter(["loss_reduction" if x < 0 else "loss_increase" for x in [y for y in deltas if y != 0]])
        Out[17]: Counter({'loss_reduction': 260, 'loss_increase': 240})
      
      

      And

        with plt.style.context("fivethirtyeight"):
            plt.plot(deltas)
            out_loc = f"{utc_ts(utc_now())}-micro-batch-loss-deltas-over-steps.png"
            print("saving to", out_loc)
            pylab.savefig(out_loc, bbox_inches="tight")
            pylab.close()
            plt.close()
      
      

      But wow, this next plot is fascinating!

        with plt.style.context("fivethirtyeight"):
            fig = plt.figure(figsize =(20, 9))
        
            plt.plot(deltas, linewidth=0.7)
            plt.title("Microbatch loss_after - loss_before")
            out_loc = f"{utc_ts(utc_now())}-micro-batch-loss-deltas-over-steps.png"
            print("saving to", out_loc)
            pylab.savefig(out_loc, bbox_inches="tight")
            pylab.close()
            plt.close()
      
      


    • So according to the above, yes the microbatch delta loss is ping ponging back and forth and basically getting worse, for the different microbatch inputs . Wow. so glad I looked at this chronological kind of plot !