The BOOK
The Little Learner
A straight line to deep learning
Daniel P. Friedman and Anurag Mendhekar
Drawings by Qingqing Su
Forewords by Guy L. Steele Jr. and Peter Norvig
Published by The MIT Press
The Little Learner covers all the concepts necessary to develop an intuitive understanding of the workings of deep neural networks: tensors, extended operators, gradient descent algorithms, artificial neurons, dense networks, convolutional networks, residual networks and automatic differentiation.
The book aims to explain the workings of deep learning to readers who may not have yet acquired the mathematical skills necessary to dive into the subject.
Unlike other books in the field, this book makes very few assumptions about background knowledge (high-school mathematics and familiarity with programming). It uses a layered approach to construct advanced concepts from first principles using really small (“little”) programs that build on one another.
We introduce these ideas using a conversational style in Question/Answer format that is characteristic of the other books in the Little series. The conversational style puts the reader at ease and enables the introduction of ideas in frame-by-frame manner as opposed to being hit with a wall of text.
AUTHORS
Daniel P. Friedman
Professor of
Computer Science
Indiana University
Bloomington
Daniel P. Friedman is Professor of Computer Science in the School of Informatics, Computing, and Engineering at Indiana University and is the author of many books published by the MIT Press, including The Little Schemer and The Seasoned Schemer (with Matthias Felleisen); The Little Prover (with Carl Eastlund); The Reasoned Schemer (with William E. Byrd, Oleg Kiselyov, and Jason Hemann); and The Little Typer (with David Christiansen).
Anurag
Mendhekar
Co-founder and President
Paper Culture LLC
Los Altos
Anurag Mendhekar is co-founder and President of Paper Culture, where he focuses on developing artificial intelligence for creativity (also known as Generative AI). A serial entrepreneur, he started his career at the world-famous Xerox Palo Alto Research Center (PARC), where he was one of the inventors of Aspect-Oriented Programming. He earned his Ph. D. from Indiana University in Programming Languages, and since then his career has spanned a range of technologies including distributed systems, image and video compression, and video distribution for VR.
The CODE
About the code
A Racket Package
The code for the book is available as a Racket package known as malt. In order to run and examine the code, follow the instructions in this section.
Transcribing code from the book
An important reminder
We use some special notations in the book to make our code fit nicely into "little" boxes.
Please refer to the notations in Transcribing to Scheme on page xxiii to convert the notation into Scheme/Racket code.
Step 1
Install Racket
Download and Install Racket for your operating system here. The minimum version of Racket you will need is 8.2. Installing Racket will also install the command line program called raco. Ensure that the executable lies in your path (set the environment variable PATH to include the Racket directory. This step will depend upon your OS).
Step 2
Get the malt package
There are two ways to install the malt package.
The first one is to install it directly from the command line as a Racket package using raco. This will install the package in a default location. In order to follow the source code, you'll have to locate the package in your installation and follow the code from there (you can also look at it directly from the Git repository.)
raco pkg install malt
You should also run raco pkg update malt periodically to make sure you have the latest updates.
The second way to install malt is using the Git repository. This method also allows you to explore the source code of the package more easily. The following instructions help you install it, but refer to the README.md file of the repository for further instructions.
For MacOS and Linux:
git clone https://github.com/themetaschemer/malt.git
cd malt
make
make install
For Windows:
git clone https://github.com/themetaschemer/malt.git
cd malt
raco pkg install
You should also run git pull periodically in the repository to make sure you have the latest updates.
Step 3
Start Exploring
Racket can be run either on the command line or by using DrRacket which is an IDE that comes with the Racket installation. Follow the DrRacket tutorials if you have never used it before. To start working with malt, begin your file like this.
#lang racket
(require malt)
Reference
Documentation for Malt
The detailed documentation for Malt in the standard Racket format is available here.
Chapter GUIDE
Chapter 0
Learning Scheme
Malt is not required to run the code from Chapter 0. Make sure you begin your file with
#lang racket
Chapters 1 through 12
And Interludes I-IV
The malt package is required for running the code in these chapters and most of the interludes, except for Interlude V. Make sure you begin your files with
#lang racket
(require malt)
Chapter 6 onwards
Tiny Variations
Starting in chapter 6, you might begin to see tiny variations occurring in repeated runs of the same code. This is normal due to the stochastic nature of sampling-obj where ever it gets used.
Interlude V
Extended functions
Malt comes with three different implementations of tensors: learner, nested-tensors, and flat-tensors, which are the three ways tensors and automatic differentiation are developed in the Appendices A and B. The default representation is learner.
Interlude V describes the semantics of extended functions and how they can be implemented, but it requires the learner representation of tensors. If you switch implementations away from learner you will need to do temporarily switch implementations back to learner. In order to experiment with this code (only if you have switched away from learner), start your file with these lines
#lang racket
(require malt/interlude-V)
This will switch the representation of tensors for the remainder of this file, but tensors exported out from this file may not work well if the rest of the code is built with a different representation of tensors. To switch representations completely back to learner, follow the instructions in the README.md of the Git repository, or from the malt documentation.
Chapter 13 Interlude VI
Iris
How to run the Iris example.
And a word of advice.
The Iris example in the book requires the Iris data set to be loaded. It can be loaded like this
#lang racket
(require malt)
(require malt/examples/iris)
This will make the following data set definitions available
iris-train-xs
iris-train-ys
iris-test-xs
iris-test-ys
The example can be run with
(run-iris)
This will run a grid-search to find a well-trained theta, and test its accuracy against iris-test-xs and iris-test-ys. All the code associated with the example is located in the malt/examples/iris sub-directory and can also be directly examined here.
A word of advice. All neural networks are susceptible to variation depending upon how they are initialized. For larger networks, this is usually not a problem because the training process evens out these variations.
The neural network defined for the Iris example, however, is very small. This makes it susceptible to much larger variations because of the randomness in its initialization. What is important, however, is that we arrive at a trained theta that passes our accuracy thresholds.
Running grid-search with iris-classifier in order to find a trained theta, consequently, can end up with a different result than what may be described in the book.
Therefore, with the code for the book, we have included the initial theta that was used when the book went to print. It can be examined like this
tll-iris-initial-theta
The example on page 266 can then be run like this
(define iris-theta
(with-hypers ((revs 2000)
(alpha 0.0002)
(batch-size 8))
(naked-gradient-descent
(sampling-obj
(l2-loss iris-classifier)
iris-train-xs iris-train-ys)
tll-iris-initial-theta)))
The trained theta generated will also have some variation because of the stochastic nature of naked-gradient-descent using sampling-obj. This means that the accuracy from one trained iris-theta to another varies somewhat as well.
Readers are encouraged to experiment with grid-search as described in Interlude VI to find the right combination of hyperparameters for accuracy that is as high as possible for the iris-test-xs and iris-test-ys.
Chapter 15
Morse
How to run the Morse example.
Important. The morse example in the book will require you to run it using a faster implementation of tensors. Please follow the instructions in the README.md in the repository or the malt documentation to switch implementations to flat-tensors or nested-tensors.
The morse example in the book also requires its own data set to be loaded. It is done as follows
#lang racket
(require malt)
(require malt/examples/morse)
This will load the data set and provide the following training set definitions
morse-train-xs
morse-train-ys
and the following test set definitions.
morse-test-xs
morse-test-ys
The book describes two different networks, the first being a fully convolutional network (morse-fcn) and the second being a Residual network (morse-residual). Code for this example can be viewed here.
To train and test morse-fcn
(define fcn-model
(train-morse morse-fcn))
(accuracy fcn-model
morse-test-xs morse-test-ys)
This will display the accuracy of the trained model against the test set.
To train and test morse-residual
(define residual-model
(train-morse morse-residual))
(accuracy residual-model
morse-test-xs morse-test-ys)
This will similarly display the accuracy of the trained model against the test set.
The code in this example is also set up to display progress of the gradient descent by printing a moving average of the loss in the network at every 20 revisions. For example,
(16080 0.072560) [Memory: 139334768][Window size 6]
This says that the average of the loss across the last 6 batches at the 16080'th revision was 0.07256, while the system consumed about 139MB of memory. The count of revisions is cumulative, but can be reset by
(log-malt-reset)
Morse examples are currently set up to run 20000 revisions during training.
Appendix A
Automatic Differentiation (AD)
The malt package is not required to implement the code in Appendix A. You can refer to the code related to the learner implementation as a reference. Begin your files with
#lang racket
Appendix B
More efficient tensors and AD
Only a portion of the malt package is required to implement the code in Appendix B, which are primarily the extended non-differentiable operators for nested tensors, and duals. These are provided in the module malt/appendix-B, so begin your file with
#lang racket
(require malt/appendix-B)
For flat-tensors, refer to the code here.
follow US
Stay up-to-date with the latest
Chapter GUIDE
Errata
Things that escaped our attention in the 1st printing.
Here is a list of known errors that have been found. Should you find more, let us know using any of the methods in the
Follow section.
Chapter 2, Pg 37, Frame 26: "Given the same tensor from frame 23" should be "Given a tensor similar to the one in frame 23".
Chapter 4, Pg 76, Frame 12: "at 0.6623" should be "at 0.6263".
Interlude IV, Pg 156, Frame 7: The smoothed sequence should be
5.03 6.8 6.55 6.16 5.73 5.37 4.89
Chapter 9, Pg 174, Frame 44: The starting learning rate should be 0.01, not 0.001.
Interlude V, Pg 190, Frame 44: The definition of *-2-1 provided has some edge cases which produce incorrect results in the learner and nested-tensors representation. To fully address those cases, that definition should be replaced with the following:
(define *-1-1
(ext2 * 1 1))
(define *-2-1
(ext2 *-1-1 2 1))
Chapter 11, Pg 229, Frame 57: The definition of k-relu should be replaced with:
(define k-relu
(λ (k)
(λ (t)
(λ (θ)
(cond
((zero? k) t)
(else (((k-relu (sub1 k))
((relu t) θ))
θ2↓)))))))
Chapter 15, Pg 339, Frame 71: The definition of skip should be as follows:
(define skip
(λ (f j)
(λ (t)
(λ (θ)
(+ ((f t) θ)
(correlate θj t))))))
Appendix A, Pg 373, Frame 77: The definition of expt-0-0 should be:
(define expt-0-0
(prim2 expt
(λ (ra rb z)
(values (* z (* rb (expt ra (- rb 1))))
(* z (* (expt ra rb) (log ra)))))))
Kindle Edition, Chapter 3 Frame 44: The left-hand side same-as chart is rendered incorrectly. The correct rendering is as follows: Chapter 15, Pg 339, Frame 71: The definition of skip should be as follows:
(define skip
(λ (f j)
(λ (t)
(λ (θ)
(+ ((f t) θ)
(correlate θj t))))))
Appendix A, Pg 373, Frame 77: The definition of expt-0-0 should be:
(define expt-0-0
(prim2 expt
(λ (ra rb z)
(values (* z (* rb (expt ra (- rb 1))))
(* z (* (expt ra rb) (log ra)))))))
Appendix B, Pg 396, Frame 48: The definition of fill-gt-gu should be:
(define fill-gt-gu
(λ (gt gu init i)
(let-values ((values gti gui) (init i))
(vector-set! gt i gti)
(vector-set! gu i gui)
(cond
((zero? i) (values gt gu))
(else
(fill-gt-gu gt gu init (sub1 i)))))))
Kindle Edition, Chapter 3 Frame 44: The left-hand side same-as chart is rendered incorrectly. The correct rendering is as follows: Chapter 9, Pg 174, Frame 44: The starting learning rate should be 0.01, not 0.001.
Interlude V, Pg 190, Frame 44: The definition of *-2-1 provided has some edge cases which produce incorrect results in the learner and nested-tensors representation. To fully address those cases, that definition should be replaced with the following:
(define *-1-1
(ext2 * 1 1))
(define *-2-1
(ext2 *-1-1 2 1))
Chapter 11, Pg 229, Frame 57: The definition of k-relu should be replaced with:
(define k-relu
(λ (k)
(λ (t)
(λ (θ)
(cond
((zero? k) t)
(else (((k-relu (sub1 k))
((relu t) θ))
θ2↓)))))))
Chapter 15, Pg 339, Frame 71: The definition of skip should be as follows:
(define skip
(λ (f j)
(λ (t)
(λ (θ)
(+ ((f t) θ)
(correlate θj t))))))
Appendix A, Pg 373, Frame 77: The definition of expt-0-0 should be:
(define expt-0-0
(prim2 expt
(λ (ra rb z)
(values (* z (* rb (expt ra (- rb 1))))
(* z (* (expt ra rb) (log ra)))))))
Kindle Edition, Chapter 3 Frame 44: The left-hand side same-as chart is rendered incorrectly. The correct rendering is as follows: Chapter 15, Pg 339, Frame 71: The definition of skip should be as follows:
(define skip
(λ (f j)
(λ (t)
(λ (θ)
(+ ((f t) θ)
(correlate θj t))))))
Appendix A, Pg 373, Frame 77: The definition of expt-0-0 should be:
(define expt-0-0
(prim2 expt
(λ (ra rb z)
(values (* z (* rb (expt ra (- rb 1))))
(* z (* (expt ra rb) (log ra)))))))
Appendix B, Pg 396, Frame 48: The definition of fill-gt-gu should be:
(define fill-gt-gu
(λ (gt gu init i)
(let-values ((values gti gui) (init i))
(vector-set! gt i gti)
(vector-set! gu i gui)
(cond
((zero? i) (values gt gu))
(else
(fill-gt-gu gt gu init (sub1 i)))))))
Kindle Edition, Chapter 3 Frame 44: The left-hand side same-as chart is rendered incorrectly. The correct rendering is as follows: