Back prop from scratch 2022-10-12

ok [[my backprop SGD from scratch 2022-Aug]]

looking over results from last time, indeed so strange how microbatch loss was going back and forth and eventually trending that the plot of my change in loss, is increasing.

		    deltas = [x["loss_after"] - x["loss_before"] for x in metrics["micro_batch_updates"]]

although initially the values were some negatives, as well.

But I wonder does it indeed something is terribly wrong if this number ever goes up at all? I think maybe yes unless this indicates the learning rate is still too high ? I am using 0.01 , but maybe it is still too high when using a single example at a time.

Ok let me try even smaller learning rate,

			  
import network as n
import dataset
import plot
import runner
import ipdb
import matplotlib.pyplot as plt
import pylab
from collections import Counter
from utils import utc_now, utc_ts

data = dataset.build_dataset_inside_outside_circle(0.5)
parameters = {"learning_rate": 0.001,
              "steps": 1000,
              "log_loss_every_k_steps": 10
             }

model, artifacts, metrics = runner.train_and_analysis(data, parameters)

			  outer: 100%|█████████████████████████████████████████████████████████████| 1000/1000 [00:11<00:00, 83.82it/s]
			                                                                                                             saving to 2022-10-12T175402.png                                                                                
			  2022-10-12T175402.png
			  2022-10-12T175403-weights.png
			  2022-10-12T175404-hist.png
			  saving to 2022-10-12T175404-scatter.png
			  2022-10-12T175404-scatter.png
			  saving to 2022-10-12T175404-micro-batch-loss-deltas-over-steps.png
			  2022-10-12T175404-micro-batch-loss-deltas-over-steps.png

2022-10-12T175404-micro-batch-loss-deltas-over-steps.png

14:01 well the worsening of the loss trend is still there and only thing that seems to have changed is the scale difference in the loss is now proportionally smaller, following the reduction of the learning rate from 0.01 to 0.001 I suppose
So yea wondering if I ought to next just look for more bugs or consider increasing the batch size from one to more.
Oh yea and in any case the fact that the train loss is slightly worse than the validation loss is another red flag. And of course loss in both cases is going up so yea still fundamental problems.

Matt Mazur article. will look again.

14:46 going to do a super simple test of the feed forward now. Unclear what the problem is maybe there is some fundamental matrix multiplication bug?

ok , starting with a blank network, with random weights, going to follow one or two inputs to the end,

			  
import network as n
parameters = {"learning_rate": 0.01}
model = n.initialize_model(parameters)

def feed_forward_manually(model, x):
    x1, x2 = x[0], x[1]
    w1, w2, w3 = model.layers[0].weights[0]
    w4, w5, w6 = model.layers[0].weights[1]
    h1 = n.logit_to_prob(x1*w1 + x2*w4 + 1)
    h2 = n.logit_to_prob(x1*w2 + x2*w5 + 1)
    h3 = n.logit_to_prob(x1*w3 + x2*w6 + 1)

    
    w7, w8 = model.layers[1].weights[0]
    w9, w10 = model.layers[1].weights[1]
    w11, w12 = model.layers[1].weights[2]
    h4 = n.logit_to_prob(h1*w7 + h2*w9 + h3*w11 + 1)
    h5 = n.logit_to_prob(h1*w8 + h1*w10 + h3*w12 + 1)
    
    w13 = model.layers[2].weights[0][0]
    w14 = model.layers[2].weights[1][0]
    
    y_prob = n.logit_to_prob(h4*w13 + h5*w14 + 1)
    return y_prob
    
x = [1, 2]
y_prob_manually = feed_forward_manually(model, x)
y_prob_mat_mul = n.feed_forward(x, model.layers)
print("y_prob_manually", y_prob_manually, "y_prob_mat_mul", y_prob_mat_mul)

# y_prob_manually 0.7103884357305136 y_prob_mat_mul 0.47438553403530753

15:57 ok well is this a bug? not sure why this is different. But maybe this is a good test then?!
And in any case sort of perhaps I should be randomizing and also doing updates on the bias term as well. But yea first should make sure this feed forward works as expected.

And in addition I'm seeing, kind of weird but for some hand selected inputs basically the outputs seem to be kind of tightly constrained. Not much movement here , that's not ideal ,

			  
In [32]: x = [-1, 20]
    ...: y_prob_manually = feed_forward_manually(model, x)
    ...: y_prob_mat_mul = n.feed_forward(x, model.layers)
    ...: print("y_prob_manually", y_prob_manually, "y_prob_mat_mul", y_prob_mat_mul)
y_prob_manually 0.711842713079452 y_prob_mat_mul 0.4761151735416374

In [33]: x = [10, 20]
    ...: y_prob_manually = feed_forward_manually(model, x)
    ...: y_prob_mat_mul = n.feed_forward(x, model.layers)
    ...: print("y_prob_manually", y_prob_manually, "y_prob_mat_mul", y_prob_mat_mul)
y_prob_manually 0.7105989141221917 y_prob_mat_mul 0.47474962375739577

In [34]: x = [10, 200]
    ...: y_prob_manually = feed_forward_manually(model, x)
    ...: y_prob_mat_mul = n.feed_forward(x, model.layers)
    ...: print("y_prob_manually", y_prob_manually, "y_prob_mat_mul", y_prob_mat_mul)
y_prob_manually 0.711961055211584 y_prob_mat_mul 0.4762497657849592

In [35]: x = [0, 0]
    ...: y_prob_manually = feed_forward_manually(model, x)
    ...: y_prob_mat_mul = n.feed_forward(x, model.layers)
    ...: print("y_prob_manually", y_prob_manually, "y_prob_mat_mul", y_prob_mat_mul)
y_prob_manually 0.7105745880018329 y_prob_mat_mul 0.4745660467842152

In [36]: x = [-5, -5]
    ...: y_prob_manually = feed_forward_manually(model, x)
    ...: y_prob_mat_mul = n.feed_forward(x, model.layers)
    ...: print("y_prob_manually", y_prob_manually, "y_prob_mat_mul", y_prob_mat_mul)
y_prob_manually 0.7111368385474176 y_prob_mat_mul 0.4751433263136077

maybe I can pinpoint which layer has the bug?

			  
from pprint import pprint
import test_feed_forward
x = [-5, -5]
frozen = test_feed_forward.feed_forward_manually(model, x)
y_prob_mat_mul = n.feed_forward(x, model.layers)
pprint([
  ["--", "manually", "matmul"],
  ["h1", frozen["h1"], model.layers[0].nodes["h1"]],
  ["h2", frozen["h2"], model.layers[0].nodes["h2"]],
  ["h3", frozen["h3"], model.layers[0].nodes["h3"]],
  ["h4", frozen["h4"], model.layers[1].nodes["h4"]],
  ["h5", frozen["h5"], model.layers[1].nodes["h5"]],
  ["y_prob", frozen["y_prob"], y_prob_mat_mul],
])

			  
[['--', 'manually', 'matmul'],
 ['h1', 0.44128214463701015, 0.44128214463701015],
 ['h2', 0.9894985068325902, 0.9894985068325902],
 ['h3', 0.7973686345809591, 0.7973686345809591],
 ['h4', 0.7643256444514099, 0.7643256444514099],
 ['h5', 0.7752070858908596, 0.7604738710471179],
 ['y_prob', 0.7111368385474176, 0.4751433263136077]]

ok So h1, h2, h3, h4 are matching and then h5, is where the problem starts hmm .

16:24 ok think I found a small bug ,

			  
from pprint import pprint
import test_feed_forward
x = [-5, -5]
frozen = test_feed_forward.feed_forward_manually(model, x)
y_prob_mat_mul = n.feed_forward(x, model.layers)
pprint([
  ["--", "manually", "matmul"],
  ["h1", frozen["h1"], model.layers[0].nodes["h1"]],
  ["h2", frozen["h2"], model.layers[0].nodes["h2"]],
  ["h3", frozen["h3"], model.layers[0].nodes["h3"]],
  ["h4", frozen["h4"], model.layers[1].nodes["h4"]],
  ["h5", frozen["h5"], model.layers[1].nodes["h5"]],
  ["y_prob", frozen["y_prob"], y_prob_mat_mul],
])

			  
[['--', 'manually', 'matmul'],
 ['h1', 0.44128214463701015, 0.44128214463701015],
 ['h2', 0.9894985068325902, 0.9894985068325902],
 ['h3', 0.7973686345809591, 0.7973686345809591],
 ['h4', 0.7643256444514099, 0.7643256444514099],
 ['h5', 0.7604738710471179, 0.7604738710471179],
 ['y_prob', 0.7110504493887152, 0.4751433263136077]]

So weird. ok now h5 matches. just not y_prob. that bug was in my test func.

hmm ok reran one more time, so I had in my test func , an extra bias term I was adding , in the final logit, but not in the main matmul feed forward func.

16:44 ok well then theres no bug in the feed forward func, but it is weird how tight the outputs are . Something tells me actually this is related to the hard coded bias values of 1 ?

Let me loosen up the bias, maybe that helps.

Ok so before,

			  
data = [[0, 0], [4, 5], [-4, 5], [-5, -5], [5, -5], [-20, -20], [100, 100]]
pprint([[x, n.feed_forward(x, model.layers)] for x in data])
[[[0, 0], 0.4745660467842152],
 [[4, 5], 0.4738876007756717],
 [[-4, 5], 0.4762211475351283],
 [[-5, -5], 0.4751433263136077],
 [[5, -5], 0.4737412140572382],
 [[-20, -20], 0.4754772076192018],
 [[100, 100], 0.4748013350557756]]

And ,

			  
import numpy as np
model.layers[0] = model.layers[0]._replace(bias=np.array([0.1]))
model.layers[1] = model.layers[1]._replace(bias=np.array([0.1]))

data = [[0, 0], [4, 5], [-4, 5], [-5, -5], [5, -5], [-20, -20], [100, 100]]
pprint([[x, n.feed_forward(x, model.layers)] for x in data])

			  [[[0, 0], 0.4813065143876082],
			   [[4, 5], 0.4804879738767632],
			   [[-4, 5], 0.48321160815195635],
			   [[-5, -5], 0.4823729185086716],
			   [[5, -5], 0.4793507289434747],
			   [[-20, -20], 0.4822354979795412],
			   [[100, 100], 0.4810180811163541]]

hmm doesn't seem to have helped. Let me go lower,

			  
model.layers[0] = model.layers[0]._replace(bias=np.array([0.01]))
model.layers[1] = model.layers[1]._replace(bias=np.array([0.01]))
print("biases, ", model.layers[0].bias, model.layers[1].bias, model.layers[2].bias)
# biases,  [0.01] [0.01] [0]

data = [[0, 0], [4, 5], [-4, 5], [-5, -5], [5, -5], [-20, -20], [100, 100]]
pprint([[x, n.feed_forward(x, model.layers)] for x in data])

			  [[[0, 0], 0.4820947777588216],
			   [[4, 5], 0.48128620914361286],
			   [[-4, 5], 0.4839510956396274],
			   [[-5, -5], 0.4831921147269756],
			   [[5, -5], 0.4800346864013735],
			   [[-20, -20], 0.48300320046986295],
			   [[100, 100], 0.48173375233047927]]

17:03 ok so weird not even adjustments to bias helped . Just double check , with the manual feedforward too,

			  
from pprint import pprint
import test_feed_forward
x = [-5, -5]
frozen = test_feed_forward.feed_forward_manually(model, x)
y_prob_mat_mul = n.feed_forward(x, model.layers)
pprint([
  ["--", "manually", "matmul"],
  ["h1", frozen["h1"], model.layers[0].nodes["h1"]],
  ["h2", frozen["h2"], model.layers[0].nodes["h2"]],
  ["h3", frozen["h3"], model.layers[0].nodes["h3"]],
  ["h4", frozen["h4"], model.layers[1].nodes["h4"]],
  ["h5", frozen["h5"], model.layers[1].nodes["h5"]],
  ["y_prob", frozen["y_prob"], y_prob_mat_mul],
])

			  [['--', 'manually', 'matmul'],
			   ['h1', 0.22688927424734276, 0.22688927424734276],
			   ['h2', 0.9722312068731663, 0.9722312068731663],
			   ['h3', 0.5938559073859204, 0.5938559073859204],
			   ['h4', 0.51632734891332, 0.51632734891332],
			   ['h5', 0.5124838141374718, 0.5124838141374718],
			   ['y_prob', 0.4831921147269756, 0.4831921147269756]]

Ok yea, so looks like no bug and reducing the bias has not diminished how frozen the outputs seem to be.

17:11 so yea for now , feels like it is good I verified the feed forward func does what it is supposed to, but it is super weird that the network is really tightly configured. Super weird.
Maybe I should not have activation functions on the inner layers? Nah I don't think that's the problem.
Makes me wonder what about something about this particular multi-layer network architecture that is being weird? Maybe I should try different architectures? Do they have similar properties?