Pytorch Tutorial from Basic to Advance Level: A NumPy replacement and Deep Learning Framework that provides maximum flexibility with speed

45 min readApr 11, 2020

Basics of PyTorch, Tensors, Variable, CPU vs GPU, Computational Graph: Numpy vs Pytorch,Module,CUDA Tensors, Autograd ,Converting NumPy Array to Torch Tensor, Data Parallelism using GPUs, Mathematical Operations, Matrix Initialization and Matrix Operations, Optim Module, nn Module, Deep Learning Algorithm: A perceptron, Multiclass classifier, Backpropagation in Pytorch, MultiLayer Perceptron, Convolutional Layer,Transposed Convolutional Layer, Quasi Recurrent Neural Network , Auto-Encoder Neural Network, Variational Auto-Encoder, Deconvolutional Neural Network, AlexNet,Time delay neural network, Deep Convolution GAN, GPU Neural Network Training, Data Parallelism Using CUDA

A. Introduction

In this blog, I will explain about Pytorch open-source Deep Learning library with the algorithmic implementation of Deep learning concepts. Pytorch is developed by the Facebook artificial-intelligence research group and Uber’s Pyro software for probabilistic programming. They also kept the GPU based hardware acceleration as well as the extensibility features that made Lua-based Torch. The major features of PyTorch are easy to interface, python usage, and computational graph.

This library has three levels of abstraction:

i. Tensor − Imperative n-dimensional array which runs on GPU.

ii. Variable − Node in the computational graph. This stores data and gradient.

iii. Module − Neural network layer which will store state or learnable weights.

PyTorch, and most of the other deep learning frameworks, can be used for two different things:

i. Replacing NumPy-like operations with GPU-accelerated operations

ii. Building deep neural networks

It was primarily built for research, it is not recommended for production usage in certain scenarios where latency requirements are very high.

B. PyTorch Tensors

Tensors are the key components of Pytorch. PyTorch is an optimized tensor manipulation library that offers an array of packages for deep learning. At the core of the library is the tensor, which is a mathematical object holding some multidimensional data. A tensor of order zero is just a number or a scalar. A tensor of order one (1storder tensor) is an array of numbers or a vector. Similarly, a 2ndorder tensor is an array of vectors or a matrix.

In mathematics, a rectangular array of numbers is called metrics. In NumPy library, these metrics called ndarray. In PyTorch, it is known as Tensor. A tensor is an n-dimensional data container. For example, In PyTorch, 1d-tensor is a vector, 2d-tensor is a metrics, 3d- tensor is a cube, and 4d-tensor is a cube vector. Torch provides tensor computation with strong GPU acceleration. It is essential that we get familiar with the tensor data structure to work with PyTorch. It will serve as a fundamental prerequisite before neural network implementation. In Deep Learning, Tensor is the key part, and we can see so many discussions around Tensor. Even it appears in the name of Google’s main machine learning library, i.e., TensorFlow.

CUDA Tensors

To use a GPU, you need to first allocate the tensor on the GPU’s memory. So far, we have been allocating our tensors to the CPU memory. When doing linear algebra operations, it might make sense to utilize a GPU, if you have one. Access to the GPUs is via a specialized API called CUDA. The CUDA API was created by NVIDIA and is limited to use on only NVIDIA GPUs.

See tensor core mixed-precision accumulator in the figure below,

PyTorch offers CUDA tensor objects that are indistinguishable in use from the regular CPU bound tensors except for the way they are allocated internally. PyTorch makes it very easy to create these CUDA tensors, transferring the tensor from the CPU to the GPU while maintaining its underlying type.

C. Pytorch Variable

A PyTorch Variable is a wrapper around a PyTorch Tensor, and represents a node in a computational graph. If x is a Variable then x.data is a Tensor giving its value, and x.grad is another Variable holding the gradient of x with respect to some scalar value.

Autograd

Autograd is a PyTorch package for the differentiation for all operations on Tensors. It performs the backpropagation starting from a variable. We access individual gradient through the attributes grad of a variable x.grad. autograd.Variable is the central class of the package. It wraps a Tensor and supports nearly all of the operations defined on it. Once you finish your computation you can call .backward() and have all the gradients computed automatically. You can access the raw tensor through the .data attribute, while the gradient w.r.t. this variable is accumulated into .grad.

The third attribute a Variable holds is a grad_fn, a Function object which created the variable.

PyTorch Variables have the same API as PyTorch tensors: any operation you can do on a Tensor you can also do on a Variable; the difference is that autograd allows you to automatically compute gradients.

Computation Graphs

While training a neural network, we need to compute gradients of the loss function, with respect to every weight and bias, and then update these weights using gradient descent. PyTorch creates Dynamic Computation Graph until the forward function of a Variable is called, there exists no node for the Variable (it’s grad_fn) in the graph. The graph is created as a result of forward function of many Variables being invoked.

The code in PyTorch to creates a computation graph is as below,

Code

from torch import FloatTensor
from torch.autograd import Variable  # Define the leaf nodes
a = Variable(FloatTensor([4])) weights = [Variable(FloatTensor([i]), requires_grad=True) for i in (2, 5, 9, 7)] # unpack the weights for nicer assignment
w1, w2, w3, w4 = weights b = w1 * a
c = w2 * a
d = w3 * b + w4 * cL = (10 - d) 
L.backward() 
for index, weight in enumerate(weights, start=1):    
    gradient, *_ = weight.grad.data    
    print(f"Gradient of w{index} w.r.t to L: {gradient}")#output:
Gradient of L w.r.t to w1: -36.0  
Gradient of L w.r.t to w2: -28.0  
Gradient of L w.r.t to w3: -8.0  
Gradient of L w.r.t to w4: -20.0

If one operand of operation has requires_grad set to True, so the result will be,

On turning requires_grad = True PyTorch will start tracking the operation and store the gradient functions at each step as follows:

Basic Example

#import packages
import torch
import numpy as np
import torch.nn as nn
torch.__version__#output: '1.4.0'#Numpy example
'''Creating numpy variable'''
num1 = np.array([4, 5, 10])
num2 = np.array([12,10, 9])#sum
num = num1 + num1
num
#output:array([ 8, 10, 20])#Torch example
'''Creating a torch tensors'''
num1 = torch.tensor([4, 5, 10])
num2 = torch.tensor([12,10, 9])
#sum
num = num1 + num1
num
#output:tensor([ 8, 10, 20])
#Creating a torch tensor
val = torch.tensor([4,10,21,2])
val
#output: tensor([ 4, 10, 21,  2])
#converting into numpy array
val.numpy()
#output: array([ 4, 10, 21,  2], dtype=int64)
#Create numpy variable
val= np.array([[10,12,15],[10,20,14]])
val
#output: array([[10, 12, 15],
       [10, 20, 14]])
#Conversion into torch
val=torch.from_numpy(val)
val
#output: tensor([[10, 12, 15],
        [10, 20, 14]], dtype=torch.int32)

Note:

The NumPy and PyTorch store data in memory in the same way.

Tensor Introduction

A tensor is often thought of as a generalized matrix. That is, it could be a 1-D matrix (a vector is actually such a tensor), a 3-D matrix (something like a cube of numbers), even a 0-D matrix (a single number), or a higher dimensional structure that is harder to visualize. The dimension of the tensor is called it's rank.

Size, offset, strides

In order to index into storage, tensors rely on a few pieces of information, which, together with their storage, unequivocally define them: size, storage offset, and strides. The size (or shape, in NumPy parlance) is a tuple indicating how many elements across each dimension the tensor represents. The storage offset is the index in the storage corresponding to the first element in the tensor. Stride is the number of elements in the storage that need to be skipped over to obtain the next element along each dimension.

Example

#Create a tensor
val=torch.tensor([[3,4,5,6], [4,6,7,8]])
val
#output: tensor([[3, 4, 5, 6],
        [4, 6, 7, 8]])
#data type
val.dtype
#output: torch.int64
#create a float tensor
val=torch.FloatTensor([[6,12,14,10], [10,11,12,16]])
val
#output: tensor([[ 6., 12., 14., 10.],
        [10., 11., 12., 16.]])
#Data Types
val.dtype
#output: torch.float32

Float tensor

#Creating tensor for boolean value
val=torch.tensor([[6,12,14,10], [10,11,12,16]], dtype=torch.bool)
val
#output: tensor([[True, True, True, True],
        [True, True, True, True]])
#create a matrix with random numbers 
torch.rand(6,4)
#output:
tensor([[0.1148, 0.2817, 0.0763, 0.5922],
        [0.3643, 0.0984, 0.5095, 0.9770],
                     ...              ]])
#full ones
torch.ones(4,6)
#output: tensor([[1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.]])
#Create Tensor and find their sum 
var = torch.tensor([[6,12,14,10], [10,11,12,16]])
print(var)
print(f'sum: {var.sum()}')
#output: tensor([[ 6, 12, 14, 10],
        [10, 11, 12, 16]])
sum: 91
#Transpose 
var.t()
#output:
tensor([[ 6, 10],
        [12, 11],
        [14, 12],
        [10, 16]])
# Transpose (via permute)
var.permute(-1,0)
output:
tensor([[ 6, 10],
        [12, 11],
        [14, 12],
        [10, 16]])
# Reshape via view
var.view(2,4)
output:
tensor([[ 6, 12, 14, 10],
        [10, 11, 12, 16]])
# View again...
var.view(8,1)
#output:
tensor([[ 6],
        [12],
          . ,
          . ])
# Slicing
t = torch.Tensor([[10,23,38,22], [22,32,11,23], [19,29,49,17]])
# Every row, only the last column
print(t[:, -1])
#output: tensor([22., 23., 17.])
# First 2 rows, all columns
print(t[:2, :])
#output: tensor([[10., 23., 38., 22.],
        [22., 32., 11., 23.]])
# Lower right most corner
print(t[-1:, -1:])
#output: tensor([[17.]])
#Size
t.size()
#output: torch.Size([3, 4])
#Creating a new tensor
t2 = torch.tensor([[16,13,24,23], [23,36,78,25],[26,34,68,12]])
#sum
t3 =t.add(t2)
t3
#output:tensor([[ 26.,  36.,  62.,  45.],
        [ 45.,  68.,  89.,  48.],
        [ 45.,  63., 117.,  29.]])
#In-place
t.add_(t2)
t
#output:tensor([[ 26.,  36.,  62.,  45.],
        [ 45.,  68.,  89.,  48.],
        [ 45.,  63., 117.,  29.]])

More tensor operation

#cross product
val1 = torch.randn(4,3)
val2 = torch.randn(4,3)
val1.cross(val2)
#output:
tensor([[-2.3493,  0.5739, -4.0265],
        [-0.9982,  0.7558, -0.2420],
        [ 0.3225, -1.4824, -0.8597],
        [-0.0962, -1.5052,  0.0313]])
#matrix product
var = (torch.Tensor([[10, 14], [26, 30]]).mm(torch.Tensor([[10], [20]])))
var
#output:tensor([[380.],
        [860.]])
# Elementwise multiplication
val= torch.Tensor([[20,12], [10,21]])
val.mul(val)
#output: tensor([[400., 144.],
        [100., 441.]])

Running on GPU

GPUs have more cores than CPU and hence when it comes to parallel computing of data, GPUs perform exceptionally better than CPU even though GPU has lower clock speed and it lacks several core management features as compared to the CPU.
Thus, running a python script on GPU can prove out to be comparatively faster than CPU, however, it must be noted that for processing a data set with GPU, the data will first be transferred to the GPU’s memory which may require additional time so if data set is small then CPU may perform better than GPU.

GPU installation guide

Use the following link to install CUDA, cuDNN and GPU support on Windows 10:

Installing Tensorflow with CUDA, cuDNN and GPU support on Windows 10

Pimp Up your PC for Deep Learning — Part 2

towardsdatascience.com

After installation type command nvidia-smi to check CUDA installation,

Basic Example

#Checking GPU available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device
#output: device(type='cuda')
#Create a tensor
x = torch.tensor([[10,23,34,57], [23,18,37,27]])
x.to(device)
#output: tensor([[10, 23, 34, 57],
        [23, 18, 37, 27]], device='cuda:0')
#Tensor is now on device cuda:0
x = x.to(device)
#Create another tensor and convert it into cuda:0
y = torch.tensor([[10,34,28,29], [56,29,10,28]])
y = y.to(device)
#sum
x.add(y)
#output:tensor([[20, 57, 62, 86],
        [79, 47, 47, 55]], device='cuda:0')
# Create uninitialized tensor
x = torch.FloatTensor(4,2)
print(x)
# Initialize to zeros
x.zero_()
print(x)
#output: tensor([[0.0000e+00, 0.0000e+00],
                  ...])
#output: tensor([[0., 0.],
           ... )
# special tensors
print(torch.eye(4))
print(torch.ones(2,3))
print(torch.zeros(2,3))
print(torch.arange(0,3))
#output: tensor([[1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.]])
tensor([[1., 1., 1.],
        [1., 1., 1.]])
tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([0, 1, 2])

Basic of Maths

#Torch arange
x = torch.arange(-1.0,1.0,5)
print(x)
print(torch.sum(x))
print(torch.sum(torch.exp(x)))
print(torch.mean(x))
#output: tensor([-1.])
         tensor(-1.)
         tensor(0.3679)
         tensor(-1.)#Random
x = torch.rand(3,2)
print(x)
print(x[1,:])#output: tensor([[0.5109, 0.6509],
        [0.6475, 0.4615],
        [0.9806, 0.7422]])
        tensor([0.6475, 0.4615])
# create a tensor
x = torch.rand(4,2)
# copy to GPU
y = x.cuda()
# copy back to CPU
z = y.cpu()
# get CPU tensor as numpy array
# cannot get GPU tensor as numpy array directly
try:
    y.numpy()
except Exception as e:
    print(e)
#output: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
#create pytorch tensor
x = torch.rand(3,5)  # CPU tensor
y = torch.rand(5,4).cuda()  # GPU tensor
try:
    torch.mm(x,y)  # Operation between CPU and GPU fails
except Exception as e:
    print(e)
#output: Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_mm# Put tensor on CUDA
x = torch.rand(4,2)
if torch.cuda.is_available():
    x = x.cuda()
    print(x, x.dtype)
#output: tensor([[0.2241, 0.2719],
        [0.4515, 0.0441],
        [0.5389, 0.4390],
        [0.7164, 0.9530]], device='cuda:0') torch.float32
# Do some calculations
y = x ** 2 
print(y)
# Copy to CPU if on GPU
if y.is_cuda:
    y = y.cpu()
    print(y, y.dtype)
#output: tensor([[0.0502, 0.0739],
        [0.2039, 0.0019],
        [0.2905, 0.1927],
        [0.5132, 0.9081]], device='cuda:0')
tensor([[0.0502, 0.0739],
        [0.2039, 0.0019],
        [0.2905, 0.1927],
        [0.5132, 0.9081]]) torch.float32
#Example
x1 = torch.rand(4,2)
x2 = x1.new(1,2)  # create cpu tensor
print(x2)
x1 = torch.rand(4,2).cuda()
x2 = x1.new(1,2)  # create cuda tensor
print(x2)
#output: tensor([[1.4013e-45, 0.0000e+00]])
         tensor([[0.0502, 0.0739]], device='cuda:0')
#Time taken for processing CPU vs GPU
from timeit import timeit
# Create random data
val1 = torch.rand(5000,64) #creating cpu tensor
val2 = torch.rand(64,32)  #creating cpu tensor
number = 10000  # number of iterations
def square():
    '''dot product (mm=matrix multiplication)'''
    val=torch.mm(val1, val2) 

# Time CPU
print('CPU: {}ms'.format(timeit(square, number=number)*1000))
# Time GPU
val1, val2 = val1.cuda(), val2.cuda()
print('GPU: {}ms'.format(timeit(square, number=number)*1000))
#output: CPU: 1314.0355000000027ms
         GPU: 1909.7302999999997ms

Another Example

# Differentiation
import torch
import matplotlib.pyplot as plt 
x = torch.linspace(-10.0,10.0,10, requires_grad=True)
Y = x**2
y = torch.sum(x**2)     
y.backward()

plt.plot(x.detach().numpy(), Y.detach().numpy(), label="Y")
plt.plot(x.detach().numpy(), x.grad.detach().numpy(), label="derivatives")
plt.legend()
#output:<matplotlib.legend.Legend at 0x177906144c8>

# Create a variable
x=torch.linspace(-10.0,10.0,10, requires_grad=True)
# Differentiate
torch.sum(x**2).backward()
print(x.grad)
# Differentiate again (accumulates gradient)
torch.sum(x**2).backward()
print(x.grad)
# Zero gradient before differentiating
x.grad.data.zero_()
torch.sum(x**2).backward()
print(x.grad)#output: tensor([-20.0000, -15.5556, -11.1111,  -6.6667,  -2.2222,   2.2222,   6.6667,
           ... ])
#Example 
x=torch.tensor(torch.arange(0.0,4.0), requires_grad=True)
try:
    x.numpy() # raises an exception
except Exception as e:
    print(e)
#output: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.
#Example
x=torch.tensor(torch.arange(0,4.0), requires_grad=True)
y=x**2
z=y**2
z.detach().numpy()
#output: array([ 0.,  1., 16., 81.], dtype=float32)
#Example
x = torch.ones(4,5)
y = torch.arange(5)
print(x+y)
y = torch.arange(4).view(-1,1)
print(x+y)
y = torch.arange(4)
try:
        print(x+y) # exception
except Exception as e:
    print(e)#output: tensor([[1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.]])
         tensor([[1., 1., 1., 1., 1.],
        [2., 2., 2., 2., 2.],
        [3., 3., 3., 3., 3.],
        [4., 4., 4., 4., 4.]])
The size of tensor a (5) must match the size of tensor b (4) at non-singleton dimension 1

Neural Network Modules

Neurons

A neuron is a mathematical function that takes one or more input values, and outputs a single numerical value:

The neuron is defined as follows:

Formula

An abstraction of a neural network representing a cube. Different layers encode features with different levels of abstraction as below,

Example

#Simple Linear Module
import torch
import torch.nn as nn

x = torch.tensor([[1.0, -1.0],
                  [0.0,  1.0],
                  [0.0,  0.0]])

in_features = x.shape[1]  # = 2
out_features = 2

net = nn.Linear(in_features, out_features)
y = net(x)
print(y)#output: tensor([[ 1.3718, -0.0525],
        [-0.0180, -0.5750],
        [ 0.6493, -0.2669]], grad_fn=<AddmmBackward>)'''create a simple sequential network (`nn.Module` object) from layers (other `nn.Module` objects).
Here a MLP with 2 layers and sigmoid activation.'''net = torch.nn.Sequential(torch.nn.Linear(32,128),
                           torch.nn.Sigmoid(),
                           torch.nn.Linear(128,10))# customizable network module creation
class MyNetwork(torch.nn.Module):
    # you can use the layer sizes as initialization arguments if you #want to
    def __init__(self,input_size, hidden_size, output_size):
        super().__init__()
        self.layer1 = torch.nn.Linear(input_size,hidden_size)
        self.layer2 = torch.nn.Sigmoid()
        self.layer3 = torch.nn.Linear(hidden_size,output_size)

    def forward(self, input_val):
        h = input_val
        h = self.layer1(h)
        h = self.layer2(h)
        h = self.layer3(h)
        return h

net = MyNetwork(32,128,10)#print
for param in net.parameters():
    print(param)#output: Parameter containing:
tensor([[ 0.1360, -0.0284, -0.0086,  ..., -0.0159,  0.1162, -0.0504],
        [-0.1099, -0.1526,  0.0903,  ...,  0.0514, -0.0052,  0.0328],
        ...)]],# customizable network module creation with parameters
class MyNetworkWithParams(nn.Module):
    def __init__(self,input_size, hidden_size, output_size):
        super(MyNetworkWithParams,self).__init__()
        self.layer1_weights = nn.Parameter(torch.randn(input_size,hidden_size))
        self.layer1_bias = nn.Parameter(torch.randn(hidden_size))
        self.layer2_weights = nn.Parameter(torch.randn(hidden_size,output_size))
        self.layer2_bias = nn.Parameter(torch.randn(output_size))
        
    def forward(self,x):
        h1 = torch.matmul(x,self.layer1_weights) + self.layer1_bias
        h1_act = torch.max(h1, torch.zeros(h1.size())) # ReLU
        output = torch.matmul(h1_act,self.layer2_weights) + self.layer2_bias
        return output

net = MyNetworkWithParams(32,128,10)#print
for param in net.parameters():
    print(param)#output: Parameter containing:
tensor([[ 2.4859, -1.6750,  0.0665,  ...,  2.0735,  0.0214, -0.8285],
        ... ,
        ... , requires_grad=True)#Basic Training with above Customized Module
#Assign parameters value
net = MyNetwork(32,128,10)
#Processing
x = torch.tensor([np.arange(32), np.zeros(32),np.ones(32)]).float()
y = torch.tensor([0,3,9])
criterion = nn.CrossEntropyLoss()

output = net(x)
loss = criterion(output,y)
print(loss)#output: tensor(2.2683, grad_fn=<NllLossBackward>)# equivalent
criterion2 = nn.NLLLoss()
sf = nn.LogSoftmax()
output = net(x)
loss = criterion(sf(output),y)
print(loss)#output: tensor(2.2683, grad_fn=<NllLossBackward>)#accumulates gradient
loss.backward()
# Check that the parameters now have gradients
for param in net.parameters():
    print(param.grad)#output: tensor([[-6.2433e-03, -2.3473e-03,  1.5488e-03,  ...,  1.0674e-01,
          1.1064e-01,  1.1454e-01],
        ... ])#forward prop and backward prop again
output = net(x)
loss = criterion(output,y)
loss.backward()
for param in net.parameters():
    print(param.grad)#output: tensor([[-1.2487e-02, -4.6945e-03,  3.0976e-03,  ...,  2.1349e-01,
       ... ]])#Removing this behavior by reinitializing the gradients
net.zero_grad()
output = net(x)
loss = criterion(output,y)
loss.backward()
for param in net.parameters():
    print(param.grad)#output: tensor([[-6.2433e-03, -2.3473e-03,  1.5488e-03,  ...,  1.0674e-01,
          1.1064e-01,  1.1454e-01],
       ... ])#Optimizing
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)

print("Parameters before gradient descent :")
for param in net.parameters():
    print(param)#output: Parameters before gradient descent :
Parameter containing:
tensor([[-0.1649,  0.1513,  0.0330,  ...,  0.1573,  0.1200, -0.0095],... , requires_grad=True)#Optimize
optimizer.step()
print("Parameters after gradient descent :")
for param in net.parameters():
    print(param)#output: Parameters after gradient descent :
Parameter containing:
tensor([[-0.1648,  0.1513,  0.0330,  ...,  0.1562,  0.1189, -0.0107],
  ... , requires_grad=True)# In training loop
n_iter = 1000
for i in range(n_iter):
    optimizer.zero_grad() # equivalent to net.zero_grad()
    output = net(x)
    loss = criterion(output,y)
    loss.backward()
    optimizer.step()
    print(loss)#output: tensor(2.1482, grad_fn=<NllLossBackward>)
                      ...#print
output = net(x)
print(output)
print(y)#output: tensor([[ 7.9251, -1.6502, -1.3369, -0.1119, -1.5680, -1.1583, -1.1655, -1.2301,
         , grad_fn=<AddmmBackward>)
tensor([0, 3, 9])#Saving and Loading
# get dictionary of keys to weights using `state_dict`
net = torch.nn.Sequential(
    torch.nn.Linear(28*28,256),
    torch.nn.Sigmoid(),
    torch.nn.Linear(256,10))
print(net.state_dict().keys())#output: odict_keys(['0.weight', '0.bias', '2.weight', '2.bias'])# save a dictionary
torch.save(net.state_dict(),'test.t7')
# load a dictionary
net.load_state_dict(torch.load('test.t7'))# output: <All keys matched successfully>

Example

#Linear Model
net = nn.Sequential(nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,120))
x = torch.ones(256,2048).cuda()
y = torch.zeros(256).long().cuda()
net.cuda()
x.cuda()
crit=nn.CrossEntropyLoss()
out = net(x)
loss = crit(out,y)
loss.backward()
print(loss)#output: tensor(4.7919, device='cuda:0', grad_fn=<NllLossBackward>)#Define Class 
class MyNet(nn.Module):
    def __init__(self,n_hidden_layers):
        super(MyNet,self).__init__()
        self.n_hidden_layers=n_hidden_layers
        self.final_layer = nn.Linear(128,10)
        self.act = nn.ReLU()
        self.hidden = []
        for i in range(n_hidden_layers):
            self.hidden.append(nn.Linear(128,128))
        self.hidden = nn.ModuleList(self.hidden)
            
    def forward(self,x):
        h = x
        for i in range(self.n_hidden_layers):
            h = self.hidden[i](h)
            h = self.act(h)
        out = self.final_layer(h)
        return out
net = MyNet(32)print("Parameters are :")
for param in net.parameters():
    print(param)#output: Parameters are :
Parameter containing:
tensor([[ 0.0717,  0.0513,  0.0815,  ..., -0.0377,  0.0310,  0.0840],
       ...,
       requires_grad=True)

Deep Learning Algorithm Implementation using Pytorch on GPU

#import
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline#parameters
NUM_INPUTS=100
HIDDEN_SIZE=1024
NUM_OUTPUTS=20

Linear Regression

The linear regression model describes the output variable y (a scalar) as an affine combination of the input variables x1,x2, … xp (each a scalar) plus a noise term ε,

We refer to the coefficients β0,β1,..,βp as the parameters in the model, and we sometimes refer to β0 specifically as the intercept term. The noise term ε accounts for non-systematic, i.e., random, errors between the data and the model. The noise is assumed to have mean zero and to be independent of x.

In the above figure, The Linear regression with p = 1: The black dots represent n = 3 data points, from which a linear regression model (blue line) is learned.

#Linear Regression
lir = nn.Sequential(
    nn.Linear(NUM_INPUTS, 1))

Logistic Regression

The key idea underlying logistic regression is thus to ‘squeeze’ the output from linear regression z into the interval [0, 1] by using the logistic function,

Since the logistic function is limited to take values between 0 and 1, we obtain altogether a function from x to [0,1], which we can use as a model for p(y =1|x),

We now have a model for p(y = 1 | x) and p(y = 0 | x), which contains unknown parameters β that can be learned from training data.

#Logistic Regression
lor = nn.Sequential(
    nn.Linear(NUM_INPUTS, 1),
    nn.Sigmoid())

Softmax classifier

Softmax classifier is the function mapping as,

stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:

where we are using the notation fj to mean the j-th element of the vector of class scores f. As before, the full loss for the dataset is the mean of Li over all training examples together with a regularization term R(W). The function

softmax function

is called the softmax function. It takes a vector of arbitrary real-valued scores (in z) and squashes it to a vector of values between zero and one that sums to one.

#Softmax classifier
smx = nn.Sequential(
    nn.Linear(NUM_INPUTS, NUM_OUTPUTS),
    nn.LogSoftmax(dim=1))

Multilayer Perceptron

An MLP is a network of simple neurons called perceptrons. The perceptron computes a single output from multiple real-valued inputs by forming a linear combination according to its input weights and then possibly putting the output through some nonlinear activation function. Mathematically this can be written as

where w denotes the vector of weights,x is the vector of inputs, b is the bias and φ is the activation function. Nowadays, and especially in multilayer networks, the activation function is often chosen to be the logistic sigmoid

sigmoid

or the hyperbolic tangent tanh(x). They are related by

These functions are used because they are mathematically convenient and are close to linear near origin while saturating rather quickly when getting away from the origin. This allows MLP networks to model well both strongly and mildly nonlinear mappings.

Now an example of MLP with 40 input variables as below,

#MultiLayer Perceptron
mlp = nn.Sequential(
    nn.Linear(NUM_INPUTS, HIDDEN_SIZE),
    nn.Tanh(),
    nn.Linear(HIDDEN_SIZE, HIDDEN_SIZE),
    nn.Tanh(),
    nn.Linear(HIDDEN_SIZE, NUM_OUTPUTS),
    nn.LogSoftmax(dim=1)
)

Embedding with a fully connected layer

A Fully connected neural network with two hidden layers. The input layer has k input neurons, the first hidden layer has n hidden neurons, and the second hidden layer has m hidden neurons. The output, in this example, is the two classes y1 and y2. On top is the always-on bias neuron. A unit from one-layer is connected to all units from the previous and following layers (hence fully connected). Each connection has its own weight, w, that is not depicted for reasons of simplicity:

Let x ∈ ℝ m represent the input to a fully connected layer. Let y i ∈ ℝ be the i-th output from the fully connected layer. Then y i ∈ ℝ is computed as follows:

FC NN

Here, σ is a nonlinear function (for now, think of σ as the sigmoid function introduced in the previous chapter), and the w i are learnable parameters in the network. The full output y is then,

A network with multiple fully connected networks is often called a “deep” network.

#Embedding with fully connected layer
VOCAB_SIZE = 10000
HIDDEN_SIZE=100
# mapping a Vocabulary of size 10,000 to HIDDEN_SIZE projections
emb_1 = nn.Linear(VOCAB_SIZE, HIDDEN_SIZE)
# forward example [10, 10000] tensor
code = [1] + [0] * 9999
# copy 10 times the same code [1 0 0 0 ... 0]
x = torch.FloatTensor([code] * 10)
print('Input x tensor size: ', x.size())
y = emb_1(x)
print('Output y embedding size: ', y.size())#output:
Input x tensor size:  torch.Size([10, 10000])
Output y embedding size:  torch.Size([10, 100])#Embedding with Embedding layer
VOCAB_SIZE = 10000
HIDDEN_SIZE=100
# mapping a Vocabulary of size 10.000 to HIDDEN_SIZE projections
emb_2 = nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)# Just make a long tensor with zero-index
x = torch.zeros(10, 1).long()
print('Input x tensor size: ', x.size())
y = emb_2(x)
print('Output y embedding size: ', y.size())#output:
Input x tensor size:  torch.Size([10, 1])
Output y embedding size:  torch.Size([10, 1, 100])

Recurrent Neural Network

A recurrent neural network (RNN) is able to process a sequence of arbitrary length by recursively applying a transition function to its internal hidden state vector ht of the input sequence. The activation of the hidden state ht at time-step t is computed as a function f of the current input symbol xt and the previous hidden state ht−1,

equation

The architecture of RNN is,

RNN equations in order to go through an RNN calculation is:

Here is only three equation that is more important. The hidden nodes are a concatenation of the previous state’s output weighted by the weight matrix Wh and the input x weighted by the weight matrix Wx. The tanh function is the activation function. The output of the hidden state is the activation function applied to the hidden nodes. To make a prediction, we take the output from the current hidden state and weight it by the weight matrix Wy with a softmax activation.

# Recurrent Neural Network
NUM_INPUTS = 100
HIDDEN_SIZE = 512
NUM_LAYERS = 1
# define a recurrent layer
rnn = nn.RNN(NUM_INPUTS, HIDDEN_SIZE, num_layers=NUM_LAYERS)SEQ_LEN = 100
x = torch.randn(SEQ_LEN, 1, NUM_INPUTS)
print('Input tensor size [seq_len, bsize, hidden_size]: ', x.size())
ht, state = rnn(x, None)
print('Output tensor h[t] size [seq_len, bsize, hidden_size]: ', ht.size())#output: 
Input tensor size [seq_len, bsize, hidden_size]:  torch.Size([100, 1, 100])
Output tensor h[t] size [seq_len, bsize, hidden_size]:  torch.Size([100, 1, 512])NUM_INPUTS = 100
HIDDEN_SIZE = 512
NUM_LAYERS = 1
# define a recurrent layer, swapping batch and time axis
rnn = nn.RNN(NUM_INPUTS, HIDDEN_SIZE, num_layers=NUM_LAYERS,
            batch_first=True)SEQ_LEN = 100
x = torch.randn(1, SEQ_LEN, NUM_INPUTS)
print('Input tensor size [bsize, seq_len, hidden_size]: ', x.size())
ht, state = rnn(x, None)
print('Output tensor h[t] size [bsize, seq_len, hidden_size]: ', ht.size())#output:
Input tensor size [bsize, seq_len, hidden_size]:  torch.Size([1, 100, 100])
Output tensor h[t] size [bsize, seq_len, hidden_size]:  torch.Size([1, 100, 512])# let's check ht and state sizes
print('ht size: ', ht.size())
print('state size: ', state.size())#output:
ht size:  torch.Size([1, 100, 512])
state size:  torch.Size([1, 1, 512])NUM_INPUTS = 100
NUM_OUTPUTS = 10
HIDDEN_SIZE = 512
SEQ_LEN = 100
NUM_LAYERS = 1
# define a recurrent layer, swapping batch and time axis and connect
# an FC layer as an output layer to build a full network
rnn = nn.RNN(NUM_INPUTS, HIDDEN_SIZE, num_layers=NUM_LAYERS,
            batch_first=True)
fc = nn.Sequential(
    nn.Linear(HIDDEN_SIZE, NUM_OUTPUTS),
    nn.LogSoftmax(dim=2)
)

x = torch.randn(1, SEQ_LEN, NUM_INPUTS)
print('Input tensor size x: ', x.size())
ht, state = rnn(x, None)
print('Hidden tensor size ht: ', ht.size())
y = fc(ht)
print('Output tensor y size: ', y.size())#output:
Input tensor size x:  torch.Size([1, 100, 100])
Hidden tensor size ht:  torch.Size([1, 100, 512])
Output tensor y size:  torch.Size([1, 100, 10])

Long short-term memory (LSTM )

LSTM is a special kind of RNN, capable of learning long-term dependencies. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior. An LSTM network has three gates that update and control the cell states, these are the forget gate, input gate, and output gate. The gates use hyperbolic tangent and sigmoid activation functions. LSTM was invented specifically to avoid the vanishing gradient problem. It is supposed to do that with the Constant Error Carousel (CEC), which on the diagram below corresponds to the loop around the cell.

The LSTM has a chain-like structure as below,

The LSTM maintains a separate memory cell inside it that updates and exposes its content only when deemed necessary. We have the following terms as,

LSTM cell at T time steps diagram as below

The terms are:

i. Forget Gate “f” ( a neural network with sigmoid), ii. Candidate layer “C`”(a NN with Tanh), iii. Input Gate “I” ( a NN with sigmoid ), iv. Output Gate “O”( a NN with sigmoid), v. Hidden state “H” ( a vector ), vi. Memory state “C” ( a vector).

#LSTM Recurrent Neural Network
lstm = nn.LSTM(NUM_INPUTS, HIDDEN_SIZE, num_layers=NUM_LAYERS,
              batch_first=True)
x = torch.randn(1, SEQ_LEN, NUM_INPUTS)
print('Input tensor size x: ', x.size())
ht, states = lstm(x, None)
hT, cT = states[0], states[1]
print('Output tensor ht size: ', ht.size())
print('Last state h[T]: ', hT.size())
print('Cell state c[T]: ', cT.size())#output:
Input tensor size x:  torch.Size([1, 100, 100])
Output tensor ht size:  torch.Size([1, 100, 512])
Last state h[T]:  torch.Size([1, 1, 512])
Cell state c[T]:  torch.Size([1, 1, 512])

Convolutional Neural Network

Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply.

Consider any image, as you know every image represents some pixels in simple terms. We analyze the influence of nearby pixels in an image by using something called a filter (can be called as weights, kernels or features)

Filters are tensor which keeps track of spatial information and learns to extract features like edge detection, smooth curve, etc of objects in something called a convolutional layer. The major part is to detect edges in the images and these are detected by the filters. It helps to filter out unwanted information to amplify images. There are high-pass filters where the changes occur in intensity very quickly like from black to white pixel and vice-versa.

The following image shows the convolutional operation,

we need to reduce the size of images, if they are too large. Pooling layers section would reduce the number of parameters when the images are too large.

Next Step, is Normalization. Usually, an activation function ReLu is used. ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).

The ReLU is the most used activation function in the world right now.Since, it is used in almost all the convolutional neural networks or deep learning.

The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

#Convolutional Neural Network
NUM_CHANNELS_IN = 1
HIDDEN_SIZE = 1024
KERNEL_WIDTH = 3
# Build a one-dimensional convolutional neural layer
conv1d = nn.Conv1d(NUM_CHANNELS_IN, HIDDEN_SIZE, KERNEL_WIDTH)SEQ_LEN = 8
x = torch.randn(1, NUM_CHANNELS_IN, SEQ_LEN)
print('Input tensor size x: ', x.size())
y = conv1d(x)
print('Output tensor y size: ', y.size())#output:
Input tensor size x:  torch.Size([1, 1, 8])
Output tensor y size:  torch.Size([1, 1024, 6])NUM_CHANNELS_IN = 1
HIDDEN_SIZE = 1024
KERNEL_WIDTH = 3
PADDING = KERNEL_WIDTH // 2 # = 1
# Build a one-dimensional convolutional neural layer
conv1d = nn.Conv1d(NUM_CHANNELS_IN, HIDDEN_SIZE, KERNEL_WIDTH, 
                   padding=PADDING)

SEQ_LEN = 8
x = torch.randn(1, NUM_CHANNELS_IN, SEQ_LEN)
print('Input tensor size x: ', x.size())
y = conv1d(x)
print('Output tensor y size: ', y.size())#output:
Input tensor size x:  torch.Size([1, 1, 8])
Output tensor y size:  torch.Size([1, 1024, 8])NUM_CHANNELS_IN = 1
HIDDEN_SIZE = 1024
KERNEL_WIDTH = 3
# Build a one-dimensional convolutional neural layer
conv1d = nn.Conv1d(NUM_CHANNELS_IN, HIDDEN_SIZE, KERNEL_WIDTH)
                   
SEQ_LEN = 8
PADDING = KERNEL_WIDTH - 1 # = 2
x = torch.randn(1, NUM_CHANNELS_IN, SEQ_LEN)
print('Input tensor x size: ', x.size())
xpad = F.pad(x, (PADDING, 0))
print('Input tensor after padding xpad size: ', xpad.size())
y = conv1d(xpad)
print('Output tensor y size: ', y.size())#output:
Input tensor x size:  torch.Size([1, 1, 8])
Input tensor after padding xpad size:  torch.Size([1, 1, 10])
Output tensor y size:  torch.Size([1, 1024, 8])

Convolutional Neural Network as an MLP

MLPs and CNNs are two similar models of Neural Networks; however, they differ greatly in terms of performance. Unlike MLPs, CNN architectures are deep and require computationally expensive operations; hence two training downsides are tagged here; long time and high computation power resources. The contribution of this work is to investigate whether giving more training epochs and more data samples to MLP could hit the performance of CNN.

Consider conv1 as input to conv2, conv2 (sparse connectivity ) can operate in the same way as MLPN (full connectivity ) if the feature map size (width and height) of conv1 is equal to the size of the filters of conv2. The mathematical computation that every neuron is doing in both network types is expressed as follows:

The output of every neuron in the MLPN was found by using the algebraic dot product as similarity function and the convolution output of every neuron in the feature map is expressed as below,

The main difference between CNN and MLPN, is that in this latter every input element is connected to every neuron in the hidden layer. The General MLPN Architecture is as below,

The output of the activation map is found by using a similarity function called Convolution, as we convolve the kernel over the input matrix. Each convolution operation will produce the output of one neuron in the activation map.

Feature Map or Activation function Construction is as below,

In the above figure, the first filter will be looking for a certain feature within the input volume and will generate its own feature map which shows where that feature exists within the original image. The second filter will also do the same generating its own feature map. The convolution is a lossy operation. i.e. it does not maintain the spatial resolution. The bigger the stride and filter size, the more we lose spatial resolution. The spatial resolution of the original input can be preserved by introducing zero-padding.

#Convolutional Neural Network as an MLP
NUM_INPUTS = 100
HIDDEN_SIZE = 1024
NUM_OUTPUTS= 20
# MLP as a CNN
mlp = nn.Sequential(
    nn.Conv1d(NUM_INPUTS, HIDDEN_SIZE, 1),
    nn.Tanh(),
    nn.Conv1d(HIDDEN_SIZE, HIDDEN_SIZE, 1),
    nn.Tanh(),
    nn.Conv1d(HIDDEN_SIZE, NUM_OUTPUTS, 1),
    nn.LogSoftmax(dim=1)
)

x = torch.randn(1, 100, 1)
print('Input tensor x size: ', x.size())
y = mlp(x)
print('Output tensor y size: ', y.size())#output:
Input tensor x size:  torch.Size([1, 100, 1])
Output tensor y size:  torch.Size([1, 20, 1])

Deconvolutional Neural Network

A deconvolutional neural network is a neural network that performs an inverse convolution model. Some experts refer to the work of a deconvolutional neural network as constructing layers from an image in an upward direction, while others describe deconvolutional models as “reverse engineering” the input parameters of a convolutional neural network model. Deconvolutional neural networks are also known as deconvolutional networks, deconvs or transposed convolutional neural networks.

The network architecture of Deconvolutional Neural Network as below,

Mathematical Step by step math explaining how transpose convolution or deconvolutional neural network does 2x upsampling with 3x3 filter and stride of 2 is as below,

Some terms are:

$latex n_W = $ the input dimension of the width, and $latex n_H = $ the input dimension of the height. Since the tensor will be square we will just have the parameter n representing both.
p = is the padding applied to the tensor
f = is the size of the square kernel
s = is the stride that will be applied with the convolution operation.
c = is the channel dimension of the tensor

Now Transpose convolution outcome we will expect, is described by the following equation:

#Deconvolutional Neural Network
NUM_CHANNELS_IN = 1
HIDDEN_SIZE = 1
KERNEL_WIDTH = 8
STRIDE = 4

deconv = nn.ConvTranspose1d(NUM_CHANNELS_IN,HIDDEN_SIZE,KERNEL_WIDTH,
                          stride=STRIDE)

SEQ_LEN = 2
y = torch.randn(1, NUM_CHANNELS_IN, SEQ_LEN)
print('Input tensor y size: ', y.size())
x = deconv(y)
print('Output (interpolated) tensor x size: ', x.size())#output:
Input tensor y size:  torch.Size([1, 1, 2])
Output (interpolated) tensor x size:  torch.Size([1, 1, 12])

Quasi Recurrent Neural Network(QRNN)

Convolutional models for sequence processing have been more successful when combined with RNN layers in a hybrid architecture because traditional max- and average-pooling approaches to combining convolutional features across timesteps assume time invariance and hence cannot make full use of large-scale sequence order information. QRNNs address both drawbacks of standard models: like CNNs, QRNNs allows for parallel computation across both timestep and minibatch dimensions, enabling high throughput and good scaling to long sequences. Like RNNs, QRNNs allow the output to depend on the overall order of elements in the sequence. QRNN variants tailored to several natural language tasks, including document-level sentiment classification, language modeling, and character-level machine translation. These models outperform strong LSTM baselines on all three tasks while dramatically reducing computation time.

Block diagrams showing the computation structure of the QRNN compared with typical LSTM and CNN architectures. Red signifies convolutions or matrix multiplications; a continuous block means that those computations can proceed in parallel.

If the pooling function requires a forget gate ft and an output gate ot at each timestep, the full set of computations in the convolutional component having filter width is 2, these equations reduce to the LSTM-like,

Suitable functions for the pooling subcomponent can be constructed from the familiar elementwise gates of the traditional LSTM cell.

We term these three options f-pooling, fo-pooling, and ifo-pooling respectively; in each case, we initialize h or c to zero.

The QRNN encoder-decoder architecture used for machine translation experiments as below,

#Quasi Recurrent Neural Network
class fQRNNLayer(nn.Module):
  
  def __init__(self, num_inputs, num_outputs,
              kwidth=2):
    super().__init__()
    self.num_inputs = num_inputs
    self.num_outputs = num_outputs
    self.kwidth = kwidth
    # double feature maps for zt and ft predictions with same conv layer
    self.conv = nn.Conv1d(num_inputs, num_outputs * 2, kwidth)
    
  def forward(self, x, state=None):
    # x is [bsz, seq_len, num_inputs]
    # state is [bsz, num_outputs] dimensional
    # ---------- FEED FORWARD PART
    # inference convolutional part
    # transpose x axis first to work with CNN layer
    x = x.transpose(1, 2)
    pad = self.kwidth - 1
    xp = F.pad(x, (pad, 0))
    conv_h = self.conv(xp)
    # split convolutional layer feature maps into zt (new state
    # candidate) and forget activation ft
    zt, ft = torch.chunk(conv_h, 2, dim=1)
    # Convert forget gate into actual forget
    ft = torch.sigmoid(ft)
    # Convert zt into actual non-linear response
    zt = torch.tanh(zt)
    # ---------- SEQUENTIAL PART
    # iterate through time now to make pooling
    seqlen = ft.size(2)
    if state is None:
      # create the zero state
      ht_1 = torch.zeros(ft.size(0), self.num_outputs, 1)
    else:
      # add the dim=2 to match 3D tensor shape
      ht_1 = state.unsqueeze(2)
    zts = torch.chunk(zt, zt.size(2), dim=2)
    fts = torch.chunk(ft, ft.size(2), dim=2)
    hts = []
    for t in range(seqlen):
      ht = ht_1 * fts[t] + (1 - fts[t]) * zts[t]
      # transpose time, channels dims again to match RNN-like shape
      hts.append(ht.transpose(1, 2))
      # re-assign h[t-1] now
      ht_1 = ht
    # convert hts list into a 3D tensor [bsz, seq_len, num_outputs]
    hts = torch.cat(hts, dim=1)
    return hts, ht_1.squeeze(2)
      
    
      
fqrnn = fQRNNLayer(1, 100, 2)
x = torch.randn(1, 10, 1)
ht, state = fqrnn(x)
print('ht size: ', ht.size())
print('state size: ', state.size())ht size:  torch.Size([1, 10, 100])
state size:  torch.Size([1, 100])

AlexNet classifier

The architecture of “AlexNet” has 23 layers, integrating 5 convolution layers, 5 ReLu layers (Rectified units), 2 layers for normalization, 3 pooling layers, 3 fully connected layers, one probabilistic layer with softmax units and finally a classification layer ending in 1000 neurons for 1000 categories.

Alex-Net model is the most representative model of CNN, which has three obvious advantages, i.e., superior performance, fewer training parameters, and strong robustness. It is inspired by the principles of human vision while mimicking the way of dual-channel visual transmission and learning image features by using two channels.

AlexNet model consists of two components: convolutional layers considered as feature extractor and fully-connected layers as a classifier. The organizations of the two models are similar. In improved AlexNet model contains seven convolutional layers, two full-connected layers, and a softmax output layer.

The architectures of all the models are as below,

#AlexNet classifier
class AlexNet(nn.Module):

    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return xalexnet = AlexNet()
x = torch.randn(1, 3, 224, 224)
print('Input tensor x size: ', x.size())
y = alexnet(x)
print('Output tensor y size: ', y.size())#output:
Input tensor x size:  torch.Size([1, 3, 224, 224])
Output tensor y size:  torch.Size([1, 1000])

Time-Delayed Neural Network (TDNN)

TDNNs can be referred to as feedforward neural networks, except that the input weight has a delay element associated with it. The time-series data are often used in the input and the finite responses of the network can be captured. Accordingly, a TDNN can be considered as an ANN architecture whose main purpose is to work on sequential data. For TDNN processing, TDNN units perceive traits which are independent of time-shift and usually form part of a larger pattern recognition system. A TDNN has multiple layers and sufficient inter-connection between units in each layer to ensure the ability to learn complex nonlinear decision surfaces. In addition, the actual abstraction learned by the TDNN should be invariant under in time translation.

TDNN is to seek the relationship function H of the input-output in the network. This is given by,

where netj and y(n) are the function at the input and output layers, respectively. The architecture of a The architecture of time-delay neural network is as below,

#Time-Delayed Neural Network (TDNN)
class StatisticalPooling(nn.Module):

    def forward(self, x): 
        # x is 3-D with axis [B, feats, T]
        mu = x.mean(dim=2, keepdim=True)
        std = x.std(dim=2, keepdim=True)
        return torch.cat((mu, std), dim=1)

class TDNN(nn.Module):
    # Architecture taken from x-vectors extractor
    # https://www.danielpovey.com/files/2018_icassp_xvectors.pdf
    def __init__(self, num_inputs=24, num_outputs=2000):
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv1d(num_inputs, 512, 5, padding=2),
            nn.ReLU(inplace=True),
            nn.Conv1d(512, 512, 3, dilation=2, padding=2),
            nn.ReLU(inplace=True),
            nn.Conv1d(512, 512, 3, dilation=3, padding=3),
            nn.ReLU(inplace=True),
            nn.Conv1d(512, 512, 1), 
            nn.ReLU(inplace=True),
            nn.Conv1d(512, 1500, 1), 
            nn.ReLU(inplace=True),
            StatisticalPooling(),
            nn.Conv1d(3000, 512, 1), 
            nn.ReLU(inplace=True),
            nn.Conv1d(512, 512, 1), 
            nn.ReLU(inplace=True),
            nn.Conv1d(512, num_outputs, 1), 
            nn.LogSoftmax(dim=1)
        )   

    def forward(self, x): 
        return self.model(x)tdnn = TDNN()
x = torch.randn(1, 24, 10000)
print('Input tensor x size: ', x.size())
# The output has to contain the final pooling through time with 
# 2000 class activations so [batch_size, num_classes, 1], being the 
# latter 1 the last time-step after pooling
y = tdnn(x)
print('Output tensor y size: ', y.size())#output:
Input tensor x size:  torch.Size([1, 24, 10000])
Output tensor y size:  torch.Size([1, 2000, 1])

Residual Network

Residual blocks are basically a special case of highway networks without any gates in their skip connections. Essentially, residual blocks allows the flow of memory (or information) from initial layers to last layers. Despite the absence of gates in their skip connections, residual networks perform as good as any other highway network in practice. And before ending this article, below is an image of how the collection of all residual blocks completes into a ResNet .

The following architecture represents ResNet,

Residual block with identity mapping can be represented by the following formula:

where xl+1 and xl are input and output of the l-th unit in the network, F is a residual function and Wl are parameters of the block. Residual network consists of sequentially stacked residual blocks.

In residual networks consisted of two types of blocks:

i. basic — with two consecutive 3 × 3 convolutions with batch normalization and ReLU preceding convolution: conv3×3-conv3×3.

ii. bottleneck — with one 3 × 3 convolutions surrounded by dimensionality reducing and expanding 1×1 convolution layers: conv1×1-conv3×3-conv1×1

Structure of wide residual networks. Network width is determined by factor k is as below,

#Residual connections
class ResLayer(nn.Module):
  
  def __init__(self, num_inputs):
    super().__init__()
    self.num_inputs = num_inputs
    num_outputs = num_inputs
    self.num_outputs = num_outputs
    self.conv1 = nn.Sequential(
        nn.Conv2d(num_inputs, num_outputs, 3, padding=1),
        nn.BatchNorm2d(num_outputs),
        nn.ReLU(inplace=True)
    )
    self.conv2 = nn.Sequential(
        nn.Conv2d(num_outputs, num_outputs, 3, padding=1),
        nn.BatchNorm2d(num_outputs),
        nn.ReLU(inplace=True)
    )
    self.out_relu = nn.ReLU(inplace=True)
    
  def forward(self, x):
    # non-linear processing trunk
    conv1_h = self.conv1(x)
    conv2_h = self.conv2(conv1_h)
    # output is result of res connection + non-linear processing
    y = self.out_relu(x + conv2_h)
    return y
    
x = torch.randn(1, 64, 100, 100)
print('Input tensor x size: ', x.size())
reslayer = ResLayer(64)
y = reslayer(x)
print('Output tensor y size: ', y.size())Input tensor x size:  torch.Size([1, 64, 100, 100])
Output tensor y size:  torch.Size([1, 64, 100, 100])

Autoencoder Network

“Autoencoding” is a data compression algorithm where the compression and decompression functions are 1) data-specific, 2) lossy, and 3) learned automatically from examples rather than engineered by a human. Additionally, in almost all contexts where the term “autoencoder” is used, the compression and decompression functions are implemented with neural networks.

Autoencoders are a specific type of feedforward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation. The code is a compact “summary” or “compression” of the input, also called the latent-space representation.

An autoencoder consists of 3 components: encoder, code, and decoder. The encoder compresses the input and produces the code, the decoder then reconstructs the input only using this code.

The following architecture is for pre-train dual autoencoder to embed the inputs into a latent space, and reconstruction results are obtained by the latent representations and their noise versions based on the noisy-transformer.

The mathematical expression of Deep Autoencoder which contains following terms like Sum of Absolute Differences (SAD), Relative Error(RE), and root mean square error (RMSE) are used to measure the accuracy of the unmixing results, which are defined as follows:

Here is wi and wi denote the extracted endmember and the library spectrum, y j and y j are the reconstructions and original signature of pixel j, and hj and hj are the correspondings estimated and actual abundance fractions, respectively.

For more details read: https://www.umbc.edu/rssipl/people/aplaza/Papers/Journals/2019.TGRS.DAEN.pdf

or, http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

#Auto-Encoder Network
class AE(nn.Module):
    def __init__(self, num_inputs=784):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(num_inputs, 400),
            nn.ReLU(inplace=True),
            nn.Linear(400, 400),
            nn.ReLU(inplace=True),
            nn.Linear(400, 20)
        )
        self.decoder = nn.Sequential(
            nn.Linear(20, 400),
            nn.ReLU(inplace=True),
            nn.Linear(400, 400),
            nn.ReLU(inplace=True),
            nn.Linear(400, num_inputs)
        )

    def forward(self, x):
        return self.decoder(self.encoder(x))

      
ae = AE(784)
x = torch.randn(10, 784)
print('Input tensor x size: ', x.size())
y = ae(x)
print('Output tensor y size: ', y.size())Input tensor x size:  torch.Size([10, 784])
Output tensor y size:  torch.Size([10, 784])

Variational Auto-Encoder Network

Variational Autoencoders (VAEs) have one fundamentally unique property that separates them from vanilla autoencoders, and it is this property that makes them so useful for generative modeling: their latent spaces are, by design, continuous, allowing easy random sampling and interpolation. VAEs inherit the architecture of traditional autoencoders and use this to learn a data generating distribution, which allows us to take random samples from the latent space. These random samples can then be decoded using the decoder network to generate unique images that have similar characteristics to those that the network was trained on.

A schematic of computational flow in a variational autoencoder is as below,

A Variational autoencoder implementation for training purposes as a feedforward neural network, where P(X|z) is Gaussian is as,

In general, the variational inference is posed as the problem of finding a model distribution q(z) to approximate the true posterior p(z|y),

where the Kullback-Leibler divergence KL is a distance measure defined on probability distributions. The Kullback-Leibler divergence can then be rewritten to obtain a lower bound on the intractable marginal likelihood p(y). The Kullback-Leibler divergence is formally defined as:

For a more mathematical explanation of Autoencoder please read the following blog,

https://davidstutz.de/the-mathematics-of-variational-auto-encoders/

#Variational Auto-Encoder Network
# from https://github.com/pytorch/examples/blob/master/vae/main.py
class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()

        self.fc1 = nn.Linear(784, 400)
        self.fc21 = nn.Linear(400, 20)
        self.fc22 = nn.Linear(400, 20)
        self.fc3 = nn.Linear(20, 400)
        self.fc4 = nn.Linear(400, 784)

    def encode(self, x):
        h1 = F.relu(self.fc1(x))
        return self.fc21(h1), self.fc22(h1)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        return mu + eps*std

    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        return torch.sigmoid(self.fc4(h3))

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 784))
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

vae = VAE()
x = torch.randn(10, 784)
print('Input tensor x size: ', x.size())
y, mu, logvar = vae(x)
print('Input tensor y size: ', y.size())
print('Mean tensor mu size: ', mu.size())
print('Covariance tensor logvar size: ', logvar.size())Input tensor x size:  torch.Size([10, 784])
Input tensor y size:  torch.Size([10, 784])
Mean tensor mu size:  torch.Size([10, 20])
Covariance tensor logvar size:  torch.Size([10, 20])

Deep Convolutional Auto-Encoder with skip connections

A deep network architecture autoencoder with skip connection which has training converges much faster and attains a higher-quality local optimum. It consists of convolutional and deconvolutional layers, for image restoration. The convolutional layers act as the feature extractor which encodes the primary components of image contents while eliminating the corruption. The deconvolutional layers then decode the image abstraction to recover the image content details. In skip connections between corresponding convolutional and deconvolutional layers which help to back-propagate the gradients to bottom layers and pass image details to top layers which increase more accuracy. A benefit of our model is that our skip connections have element-wise correspondence, which can be very important in pixel-wise prediction problems such as image denoising.

A convolutional auto-encoder in combination with a skip-connection is as below,

The mathematical expression of architecture of this model, we can obtain the output of the i-th layer as follows:

Here we denote Fc and Fd the convolution and deconvolution operation in each layer which does not use the ReLU activation function. It is easy to observe that our skip connections indicate identity mapping. The output of the network is:

we can compute XL recursively and more specifically as follows,

For more detail please read: https://arxiv.org/pdf/1606.08921.pdf

#Deep Convolutional Auto-Encoder with skip connections (SEGAN G)
class DownConv1dBlock(nn.Module):
  
  def __init__(self, ninp, fmap, kwidth, stride):
    super().__init__()
    assert stride > 1, stride
    self.kwidth = kwidth
    self.conv = nn.Conv1d(ninp, fmap, kwidth, stride=stride)
    self.act = nn.ReLU(inplace=True)
  
  def forward(self, x):
    # calculate padding with stride > 1
    pad_left = self.kwidth // 2 - 1
    pad_right = self.kwidth // 2
    xp = F.pad(x, (pad_left, pad_right))
    y = self.act(self.conv(xp))
    return y

block = DownConv1dBlock(1, 1, 31, 4)
x = torch.randn(1, 1, 4000)
print('Input tensor x size: ', x.size())
y = block(x)
print('Output tensor y size: ', y.size())#output:
Input tensor x size:  torch.Size([1, 1, 4000])
Output tensor y size:  torch.Size([1, 1, 1000])#Convolutional Layer
class UpConv1dBlock(nn.Module):
  
  def __init__(self, ninp, fmap, kwidth, stride, act=True):
    super().__init__()
    assert stride > 1, stride
    self.kwidth = kwidth
    pad = max(0, (stride - kwidth) // -2)
    self.deconv = nn.ConvTranspose1d(ninp, fmap, kwidth,
                                    stride=stride,
                                    padding=pad)
    if act:
      self.act = nn.ReLU(inplace=True)
  
  def forward(self, x):
    h = self.deconv(x)
    if self.kwidth % 2 != 0:
      # drop last item for shape compatibility with TensorFlow deconvs
      h = h[:, :, :-1]
    if hasattr(self, 'act'):
      y = self.act(h)
    else:
      y = h
    return y

block = UpConv1dBlock(1, 1, 31, 4)
x = torch.randn(1, 1, 1000)
print('Input tensor x size: ', x.size())
y = block(x)
print('Output tensor y size: ', y.size())#output:
Input tensor x size:  torch.Size([1, 1, 1000])
Output tensor y size:  torch.Size([1, 1, 4000])class Conv1dGenerator(nn.Module):
  
  def __init__(self, enc_fmaps=[64, 128, 256, 512], kwidth=31,
               pooling=4):
    super().__init__()
    self.enc = nn.ModuleList()
    ninp = 1
    for enc_fmap in enc_fmaps:
      self.enc.append(DownConv1dBlock(ninp, enc_fmap, kwidth, pooling))
      ninp = enc_fmap
    
    self.dec = nn.ModuleList()
    # revert encoder feature maps
    dec_fmaps = enc_fmaps[::-1][1:] + [1]
    act = True
    for di, dec_fmap in enumerate(dec_fmaps, start=1):
      if di >= len(dec_fmaps):
        # last decoder layer has no activation
        act = False
      self.dec.append(UpConv1dBlock(ninp, dec_fmap, kwidth, pooling, act=act))
      ninp = dec_fmap
  
  def forward(self, x):
    skips = []
    h = x
    for ei, enc_layer in enumerate(self.enc, start=1):
      h = enc_layer(h)
      if ei < len(self.enc):
        skips.append(h)
    # now decode
    
    for di, dec_layer in enumerate(self.dec, start=1):
      if di > 1:
        # sum skip connection
        skip_h = skips.pop(-1)
        h = h + skip_h
      h = dec_layer(h)
    y = h
    return y
      
G = Conv1dGenerator()
x = torch.randn(1, 1, 8192)
print('Input tensor x size: ', x.size())
y = G(x)
print('Output tensor y size: ', y.size())#output:
Input tensor x size:  torch.Size([1, 1, 8192])
Output tensor y size:  torch.Size([1, 1, 8192])

Deep Convolutional Generative Adversarial Network (DCGAN)

The Generative Adversarial Network (GAN) can be defined as a generative model that lets us generate a whole image in parallel. The architecture is comprised of two models Generator (A deep network generates realistic images) and Discriminator(A deep network distinguishes real images from computer-generated images). Previously both models were implemented as Multilayer Perceptrons (MLP), although more recently, the models are implemented as deep convolutional neural networks.

The basic working of GAN is as below,

DCGANs are very similar to GANs but specifically focuses on using deep convolutional networks in place of fully-connected networks used in Vanilla GANs.

The architecture of DCGAN generator as below,

The GAN algorithm is summarized in the figure below,

#DCGAN G and D
# from https://github.com/pytorch/examples/blob/master/dcgan/main.py
class Generator(nn.Module):
    def __init__(self, nc=3):
        super().__init__()
        nz = 100
        ngf = 64
        self.main = nn.Sequential(
            # input is Z, going into a convolution
            nn.ConvTranspose2d(nz, ngf * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(ngf * 8),
            nn.ReLU(True),
            # state size. (ngf*8) x 4 x 4
            nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            # state size. (ngf*4) x 8 x 8
            nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            # state size. (ngf*2) x 16 x 16
            nn.ConvTranspose2d(ngf * 2,     ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),
            # state size. (ngf) x 32 x 32
            nn.ConvTranspose2d(    ngf,      nc, 4, 2, 1, bias=False),
            nn.Tanh()
            # state size. (nc) x 64 x 64
        )

    def forward(self, input):
      return self.main(input)

z = torch.randn(1, 100, 1, 1)
print('Input tensor z size: ', z.size())
G = Generator()
x = G(z)
print('Output tensor x size: ', x.size())#output:
Input tensor z size:  torch.Size([1, 100, 1, 1])
Output tensor x size:  torch.Size([1, 3, 64, 64])class Discriminator(nn.Module):
    def __init__(self, nc=3):
        super(Discriminator, self).__init__()
        ndf = 64
        self.main = nn.Sequential(
            # input is (nc) x 64 x 64
            nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf) x 32 x 32
            nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 2),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*2) x 16 x 16
            nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 4),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*4) x 8 x 8
            nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 8),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*8) x 4 x 4
            nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)
      
      
x = torch.randn(1, 3, 64, 64)
print('Input tensor x size: ', x.size())
D = Discriminator()
y = D(x)
print('Output tensor y size: ', y.size())#output:
Input tensor x size:  torch.Size([1, 3, 64, 64])
Output tensor y size:  torch.Size([1, 1, 1, 1])

Training on a GPU with Pytorch

A multi-layer perceptron consists of several single-layer perceptrons, which are arranged in some hierarchy. Hidden layers of a neural network are literally just adding more neurons in between the input and output layers. A multi-layer perceptron neural network (MLPNN) was used for the analysis. The MLPNN (multi-layer perceptron neural network) is a neural network for performing varied detection and classification tasks. MLPNN has features such as the ability to learn and generalize, a smaller training data requirement, fast operation, and easy implementation

Example of multi-layer perceptron neural network with one hidden layer representation as below,

Data in the input layer is labeled as x with subscripts 1, 2, 3, …, n. Neurons in the hidden layer are labeled as h with subscripts 1, 2, 3, …, n. The above figure shows how input values are forward propagated into the hidden layer, and then from hidden layer to the output in MLPNN.

In mathematical terms, we can state that for the input vector X (set of values representing the problem domain) and the weight vector W (set of weights describing how important each problem domain value is) the weighted sum can be found by:

where n is the dimension of the input vector X, w0 is the bias and wi is the i-th weight.

torch.cuda is used to set up and run CUDA operations. It keeps track of the currently selected GPU, and all CUDA tensors you allocate will by default be created on that device. The selected device can be changed with a torch.cuda.device context manager. Consider random data points as below and run the model.

Dataset: 10000 random sample points.

# import
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
%matplotlib inline#Method where n: Inputs
def sample_points(n):
    '''Method to returns (X,Y), where X of shape (n,2) is the numpy array of points and Y is the (n) array of classes'''    
    radius = np.random.uniform(low=0,high=2,size=n).reshape(-1,1) # uniform radius between 0 and 2
    angle = np.random.uniform(low=0,high=2*np.pi,size=n).reshape(-1,1) # uniform angle
    x1 = radius*np.cos(angle)
    x2=radius*np.sin(angle)
    y = (radius<1).astype(int).reshape(-1)
    x = np.concatenate([x1,x2],axis=1)
    return x,y# Generate the data
trainx,trainy = sample_points(10000)
valx,valy = sample_points(500)
testx,testy = sample_points(500)
print(trainx.shape,trainy.shape)#output: (10000, 2) (10000,)#Method to generate single hidden MLP
def generate_single_hidden_MLP(n_hidden_neurons):
    return nn.Sequential(nn.Linear(2,n_hidden_neurons),nn.ReLU(),nn.Linear(n_hidden_neurons,2))model1 = generate_single_hidden_MLP(6)#Conversion
trainx = torch.from_numpy(trainx).float()
valx = torch.from_numpy(valx).float()
testx = torch.from_numpy(testx).float()
trainy = torch.from_numpy(trainy)
valy = torch.from_numpy(valy)
testy = torch.from_numpy(testy)
print(trainx.type(),trainy.type())#output:
torch.FloatTensor torch.IntTensordef training_routine(net,dataset,n_iters,gpu):
    # organize the data
    train_data,train_labels,val_data,val_labels = dataset
    
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(net.parameters(),lr=0.01)
    
    # use the flag
    train_data,train_labels = train_data,train_labels.long()
    val_data,val_labels = val_data,val_labels.long()
    if gpu:
        train_data,train_labels = train_data.cuda(),train_labels.cuda()
        val_data,val_labels = val_data.cuda(),val_labels.cuda()
        net = net.cuda() # the network parameters also need to be on the gpu !
        print("Using GPU")
    else:
        train_data,train_labels = train_data.cpu(),train_labels.cpu()
        val_data,val_labels = val_data.cpu(),val_labels.cpu()
        net = net.cpu() # the network parameters also need to be on the gpu !
        print("Using CPU")
    for i in range(n_iters):
        # forward pass
        train_output = net(train_data)
        train_loss = criterion(train_output,train_labels)
        # backward pass and optimization
        train_loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        # Once every 100 iterations, print statistics
        if i%100==0:
            print("At iteration",i)
            # compute the accuracy of the prediction
            train_prediction = train_output.cpu().detach().argmax(dim=1)
            train_accuracy = (train_prediction.cpu().numpy()==train_labels.cpu().numpy()).mean() 
            # Now for the validation set
            val_output = net(val_data)
            val_loss = criterion(val_output,val_labels)
            # compute the accuracy of the prediction
            val_prediction = val_output.cpu().detach().argmax(dim=1)
            val_accuracy = (val_prediction.cpu().numpy()==val_labels.cpu().numpy()).mean() 
            print("Training loss :",train_loss.cpu().detach().numpy())
            print("Training accuracy :",train_accuracy)
            print("Validation loss :",val_loss.cpu().detach().numpy())
            print("Validation accuracy :",val_accuracy)
    
    net = net.cpu()dataset = trainx,trainy,valx,valygpu =  True
gpu = gpu and torch.cuda.is_available() # to know if you actually can use the GPU

training_routine(model1,dataset,10000,gpu)#output:
Using GPU
At iteration 0
Training loss : 0.7118081
Training accuracy : 0.5684
Validation loss : 0.71091545
Validation accuracy : 0.564
...
Validation accuracy : 0.978# Let's try with 3 hidden neurons.
model2 = generate_single_hidden_MLP(3) 
training_routine(model2,dataset,10000,gpu)#output:
Using GPU
At iteration 0
Training loss : 0.76269054
...
Validation accuracy : 0.95out = model2(testx).argmax(dim=1).detach().numpy()
green = testx.numpy()[np.where(out==1)]
red = testx.numpy()[np.where(out==0)]
print(green.shape,red.shape)#output: (240, 2) (260, 2)#Method to print Model with datapoints
def print_model(model,datapoints):
    out = model(datapoints).argmax(dim=1).detach().numpy()
    green = datapoints.numpy()[np.where(out==1)]
    red = datapoints.numpy()[np.where(out==0)]

    circle1 = plt.Circle((0, 0), 1, color='r')
    circle2 = plt.Circle((0, 0), 1, color='y',fill=False)

    fig, ax = plt.subplots() # note we must use plt.subplots, not plt.subplot
    # (or if you have an existing figure)
    # fig = plt.gcf()
    # ax = fig.gca()
    plt.xlim((-2,2))
    plt.ylim((-2,2))

    pos_values = plt.scatter(x=green[:,0],y=green[:,1], color='g',)
    neg_values = plt.scatter(x=red[:,0],y=red[:,1], color='b',)

    ax.add_artist(circle1)
    ax.add_artist(circle2)
    ax.add_artist(pos_values)
    ax.add_artist(neg_values)print_model(model1,testx)

Output: Train the MLPNN model with six hidden Neurons on the GPU 1660 Ti with the CUDA environment.

Output: Train the MLPNN model with three hidden Neurons on the GPU 1660 Ti with the CUDA environment.

print_model(model2,testx)

model3 = generate_single_hidden_MLP(2) 
training_routine(model3,dataset,10000,gpu)#outputUsing GPU
At iteration 0
Training loss : 0.74870116
...
Validation accuracy : 0.64

Output: Train the MLPNN model with two hidden Neurons on the GPU 1660 Ti with the CUDA environment.

print_model(model3,testx)

Data-Parallelism on the GPU with CUDA

Here Linear Regression algorithm is implemented in CUDA on a parallel programming environment with GPU architecture using Geforce GTX 1660 Ti Graphics card with computing capability 1.3 to achieve computationally time-efficient performance for a RandomDataset.

For the parallel programs, CUDA is a language designed by NVIDIA for their GPU hardware. In CUDA, code that runs directly on the GPU is called a kernel. The host CPU invokes the kernel, specifying the number of parallel threads to execute the kernel. The host code also speciﬁes the conﬁguration of the execution, which is a 1, 2, or 3D grid structure onto which the parallel threads are mapped.

Basic Structure of parallelization across devices is in figure below,

The flowchart of GPU toolchain in the figure below,

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return outputmodel = Model(input_size, output_size)
if torch.cuda.device_count() > 0:
  print("We are using", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)#output:
We are using 1 GPUs!  #We can use multiple alsoDataParallel(
  (module): Model(
    (fc): Linear(in_features=5, out_features=2, bias=True)
  )
)for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(),
          "output_size", output.size())#output:
In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])	In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

Summary

In this blog, We have written the basics of Pytroch Deep Learning Framework with the implementation of all Deep Learning Algorithm with their mathematical expression. We have reviewed basic architectures (most fundamental layers), their relations against each other, and their implementations. We also have reviewed some advanced architectures built on top of the basic ones like Auto-Encoder Neural Network, Variational Auto-Encoder, AlexNet, Time delay neural network, Inception module and many more. We have also done some examples of a random sample of data parallelism and training on the GPU environment with CUDA support.

Thank you for reading this blog.

Code is available on my GitHub page: github