Lviv University
LLM challenges
BitNet paper
BitNet: Scaling Transformers for Large Language Models (2023 paper)
One solution - Quantization
In neural network quantization, the weights and activation tensors are stored in lower bit precision than the 16 or 32-bit precision they are usually trained in.
A floating-point vector \(\boldsymbol{x}\) can be expressed approximately as a scalar multiplied by a vector of integer values: \[ \hat{\boldsymbol{x}} = s_{\boldsymbol{x}} \cdot \boldsymbol{x}_{int} \approx \boldsymbol{x} \]

Hardware accelerator types
Groq LPU die
Binarization
An extreme case of quantization, applied to large language models.
Previuosly applied to CNNs and Transformers.
Description
BitNet architecture is a Transformer that replaces nn.Linear with BitLinear.
BitNet trains 1-bit Transformers from scratch, obtaining competitive results in an energy-efficient way.
Left: The computation flow of BitLinear. Right: The architecture of BitNet, consisting of the stacks of attentions and FFNs, where matrix multiplication is implemented as BitLinear.
BitLinear: Binarization
\[ \tilde{W} = sign\left(W-\alpha\right),\\ sign(W_{ij}) = \begin{cases} +1, \; W_{ij} > 0,\\ -1, \; W_{ij} \leq 0\end{cases},\\ \alpha=\frac{1}{nm} \sum\limits{ij}W_{ij},\\ W \in \mathbb{R}^{n \times m} \]
BitLinear: Quantization
Scale activations into the \([-Q_b, Q_b]\) range: \[ \tilde{x} = Quant(x) = Clip(x \times \frac{Q_b}{\gamma}, -Q_b+\epsilon, Q_b - \epsilon),\\ Clip(x,a,b) = \max(a, \min(b,x)), \; \gamma = \|x\|_{\infty}, \; Q_b = 2^{b-1}. \]
BitLinear: Formulation
\[ y = \tilde{W} \tilde{x} = \tilde{W} \times Quant(LN(x)) \times \frac{\beta \gamma}{Q_b},\\ LN(x) = \frac{x-E(x)}{\sqrt{Var(x) + \epsilon}}, \; \beta = \frac{1}{nm}\|W\|_1. \]
Model parallelism with Group Quantization and Normalization
Partitioning matrix multiplication on multiple devices.
Tensor parallelism
Problem
A prerequisite for the existing model parallelism approaches is that the tensors are independent along the partition dimension. However, all of the parameters \(\alpha\), \(\beta\), \(\gamma\), and \(\eta\) are calculated from the whole tensors, breaking the independent prerequisite.
Solution
Divide the weights and activations into groups and then independently estimate each group’s parameters. This way, the parameters can be calculated locally without requiring additional communication. This approachis called Group Quantization.
Group Quantization
Divide the weight matrix \(W\) into \(G\) groups, and estimate parameters independently: \[ \alpha_g = \frac{G}{nm} \sum\limits_{ij} W_{ij}^{(g)}, \\ \beta_g = \frac{G}{nm} \|W^{(g)}\|_1,\\ \gamma_g = \|x^{(g)}\|_{\infty},\\ \eta_g = \min\limits_{ij} x_{ij}^{(g)}. \]
Vanilla transfomers
\[ E_{add} = m \times (n-1) \times p \times \hat{E}_{add},\\ E_{mul} = m \times n \times p \times \hat{E}_{mul}. \]
BitNet
\[ E_{add} = m \times (n-1) \times p \times \hat{E}_{add},\\ E_{mul} = (m \times p + m \times n) \hat{E}_{mul}. \]
Scaling curves of BitNet and FP16 Transformers.
Zero-shot (Left) and few-shot (Right) performance of BitNet and FP16 Transformer against the inference cost.
BitNet is more stable than FP16 Transformer with a same learning rate (Left). The training stability enables BitNet a larger learning rate, resulting in better convergence (Right).
Zero-shot (Left) and few-shot (Right) results for BitNet and the post-training quantization baselines on downstream tasks.
b1.58
BitNet b1.58: every single parameter (or weight) of the LLM is ternary \(\{-1, 0, 1\}\).
Matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.
Advantages vs vanilla BitNet
b1.58: Quantization
absmean quantization function: first scales the weight matrix by its average absolute value, then rounds each value to the nearest integer among \(\{-1, 0, +1\}\): \[ \tilde{W} = RoundClip\left(\frac{W}{\gamma + \epsilon}, -1, 1\right),\\ RoundClip(x, a, b) = \max(a, \min(b, round(x))), \\ \gamma = \frac{1}{nm} \sum\limits_{ij} |W_{ij}|. \]
Perplexity as well as the cost of BitNet b1.58 and LLaMA LLM.
Decoding latency (Left) and memory consumption (Right) of BitNet b1.58 varying the model size.
Energy consumption of BitNet b1.58 compared to LLaMA LLM at 7nm process nodes. On the left is the components of arithmetic operations energy. On the right is the end-to-end energy cost across different model sizes.
Conclusion
Complex CNNs
Quaternion CNNs
Definition
Quaternion algebra \(\mathbb{H}\) is the 4-dimensional vector space over \(\mathbb{R}\), generated by the basis \(\{1, \hat{i}, \hat{j}, \hat{k}\}\), and endowed with the following multiplication rules (Hamilton product): \[ (1)(1) = 1,\\ (1)(\hat{i}) = \hat{j} \hat{k} = -\hat{k} \hat{j} = \hat{i},\\ (1)(\hat{j}) = \hat{k} \hat{i} = -\hat{i} \hat{k} = \hat{j},\\ (1)(\hat{k}) = \hat{i} \hat{j} = -\hat{j} \hat{i} = \hat{k},\\ (\hat{i})^2 = (\hat{j})^2 = (\hat{k})^2 = -1. \]
Multiplication
For two arbitrary quaternions \(\boldsymbol{p} = p_R + p_I \hat{i} + p_J \hat{j} + p_K \hat{k}\) and \(\boldsymbol{q} = q_R + q_I \hat{i} + q_J \hat{j} + q_K \hat{k}\) the multiplication is calculated as follows: \[ \boldsymbol{p} \boldsymbol{q} = p_R q_R - p_I q_I - p_J q_J- p_K q_K + \\ + (p_R q_I + p_I q_R + p_J q_K - p_K q_J) \hat{i} \\ + (p_R q_J - p_I q_K + p_J q_R + p_K q_I) \hat{j} \\ + (p_R q_K + p_I q_J - p_J q_I + p_K q_R) \hat{k}. \]
Convolutions
\[ (\boldsymbol{w} \ast \boldsymbol{q})(x,y) = \sum\limits_{r=-\frac{L}{2}}^{\frac{L}{2}} \sum\limits_{s=-\frac{L}{2}}^{\frac{L}{2}}\left[\boldsymbol{w}(r,s)\boldsymbol{q}(x-r,y-s)\right], \text{ (left-side)}\\ (\boldsymbol{q} \ast \boldsymbol{w})(x,y) = \sum\limits_{r=-\frac{L}{2}}^{\frac{L}{2}} \sum\limits_{s=-\frac{L}{2}}^{\frac{L}{2}}\left[\boldsymbol{q}(x-r,y-s)\boldsymbol{w}(r,s)\right], \text{ (right-side)}\\ (\boldsymbol{w_{left}} \ast \boldsymbol{q} \ast \boldsymbol{w_{right}})(x,y) = \\ = \sum\limits_{r=-\frac{L}{2}}^{\frac{L}{2}} \sum\limits_{s=-\frac{L}{2}}^{\frac{L}{2}}\left[\boldsymbol{w_{left}}(r,s)\boldsymbol{q}(x-r,y-s)\boldsymbol{w_{right}}(r,s)\right], \text{ (two-sided)} \]
Chain rule
If \(L\) is a real-valued loss function, and \(\boldsymbol{q} = q_R + q_I \hat{i} + q_J \hat{j} + q_K \hat{k}\), then \[ \frac{\partial L}{\partial \boldsymbol{q}} = \frac{\partial L}{\partial q_R} + \frac{\partial L}{\partial q_I} + \frac{\partial L}{\partial q_J} + \frac{\partial L}{\partial q_K}. \] For \(\boldsymbol{g} = g_R + g_I \hat{i} + g_J \hat{j} + g_K \hat{k}\) we have: \[ \frac{\partial L}{\partial \boldsymbol{g}} = \frac{\partial L}{\partial q_R}\left(\frac{\partial q_R}{\partial g_R} + \frac{\partial q_R}{\partial g_I} \hat{i} + \frac{\partial q_R}{\partial g_J}\hat{j} + \frac{\partial q_R}{\partial g_K}\hat{k} \right) + \\ + \frac{\partial L}{\partial q_I}\left(\frac{\partial q_I}{\partial g_R} + \frac{\partial q_I}{\partial g_I} \hat{i} + \frac{\partial q_I}{\partial g_J}\hat{j} + \frac{\partial q_I}{\partial g_K}\hat{k} \right) + \\ + \frac{\partial L}{\partial q_J}\left(\frac{\partial q_J}{\partial g_R} + \frac{\partial q_J}{\partial g_I} \hat{i} + \frac{\partial q_J}{\partial g_J}\hat{j} + \frac{\partial q_J}{\partial g_K}\hat{k} \right) + \\ + \frac{\partial L}{\partial q_K}\left(\frac{\partial q_K}{\partial g_R} + \frac{\partial q_K}{\partial g_I} \hat{i} + \frac{\partial q_K}{\partial g_J}\hat{j} + \frac{\partial q_K}{\partial g_K}\hat{k} \right) \]
Reduction in number of parameters
\(y = Wx, \; y \in \mathbb{R}^{4K},\; x \in \mathbb{R}^{4L}, \; W \in \mathbb{R}^{4K \times 4L}\),]
where \(W\) contains \(4 \times 4 \times K \times L\) parameters,
is mapped to
\(\boldsymbol{y} = \boldsymbol{W} \boldsymbol{x}, \; y \in \mathbb{H}^{K},\; x \in \mathbb{H}^{L}, \; W \in \mathbb{H}^{K \times L}\),
where \(\boldsymbol{W}\) contains \(4 \times K \times L\) parameters.
Description
Dynamic networks can adjust their computational path based on the input for better efficiency, making it possible to train models with trillions of parameters and accelerate models in a low-resource setting.
The three types of DNNs dynamically adjust computation timewise, widthwise and depthwise, respectively.
Skimming
Skimming was well-researched in the era of recurrent neural networks (RNN). Skimming models save computation timewise by dynamically allocating computation to different time steps, based on the input tokens.
Mixture-of-Experts
MoE horizontally extends a feedforward neural network (FFNN) with multiple sub-networks. During inference, only one or a few of these sub-networks will be activated for computation, thus can save widthwise computation.
Early exit
Early exit terminates inference at an early layer, without exhausting full computational capacity, thus saves depthwise computation.
Definition
Neuroevolution: processes of automated configuration and training of neural networks using evolutionary algorithms.
Evolutionary Algorithms, also known as Evolutionary Computation systems, are nature-inspired stochastic techniques that mimic basic principles of life.
Vanilla DL algorithms
Region Adjacency Graph.
Use-cases
Types
Data examples
Reed-Kellogg sentence diagram.
Diagram of an alcohol molecule (left), graph representation with indices (middle), and adjacency matrix (right).
Graph Learning Tasks.
Challenges
Transition function
Graph is defined as a set of vertices and a set of edges \(G = G(V,E)\).
Transition function \(f\) computes \(k\)-th embeddings \(h_i^k\) at each vertex \(v_i\): in other words, calculates the next representation of a neighborhood from the current representation. \[ h_i^k = \sum\limits_{j \in N_{v_i}} f\left(v_i^F, e_{ij}^F, v_j^F, h_j^{k-1}\right) \] where \(v_i^F\) is the central vertex,
\(N_{v_i}\) is the set of vertex indices for the direct neighbors of \(v_i\),
\(v_i^F\) and \(e_{ij}^F\) are feature vectors for vertices \(v_i\) and edges \(e_{ij}\).
Convergence
\(h_i^0\) can be defined arbitrarily on initialisation.
Banach’s fixed point theorem will guarantee that the subsequently calculated embeddings will converge to some optimal value exponentially (if \(f\) is implemented as a contraction map).
An RGNN forward pass for graph \(G(V,E)\) with \(|V|=4, |E|=4\).
Graph \(G\) goes through \(k\) layers of processing.
Single node aggregates messages from adjacent nodes.
2D Convolution vs. Graph Convolution.

Graph Convolutional Network
GCNs produce embeddings by summing features extracted from each neighboring vertex and then applying non-linearity.
Graph convolutional operation: \[ h_i^{k} = \sigma\left(\sum\limits_{j \in N_{v_i}} \frac{\boldsymbol{W} h_j^{k-1}}{\sqrt{|N_{v_i}||N_{v_j}|}}\right). \]
A ConvGNN with multiple graph convolutional layers. A graph convolutional layer encapsulates each node’s hidden representation by aggregating feature information from its neighbors. After feature aggregation, a non-linear transformation is applied to the resulted outputs. By stacking multiple layers, the final hidden representation of each node receives messages from a further neighborhood.
A ConvGNN with pooling and readout layers for graph classification. A graph convolutional layer is followed by a pooling layer to coarsen a graph into sub-graphs so that node representations on coarsened graphs represent higher graph-level representations. A readout layer summarizes the final graph representation by taking the sum/mean of hidden representations of sub-graphs.
Graph Attention Network
Graph Attention Networks (GATs) extend GCNs: instead of using the size of the neighborhoods to weight the importance of \(v_i\) to \(v_j\), they implicitly calculate the weighting based on the normalised product of an attention mechanism. \[ h_i^{k} = \sigma\left(\sum\limits_{j \in N_{v_i}} \alpha_{ij} \boldsymbol{W} h_j^{k-1}\right),\\ \alpha_{ij} = \frac{\exp\left(att\left(h_i^{k-1}, h_j^{k-1}, e_{ij}\right)\right)}{\sum\limits_{l \in N_{v_i}} \exp\left(att\left(h_i^{k-1}, h_l^{k-1}, e_{il}\right)\right)}. \]
Description
GAEs represent the application of GNNs (often CGNNs) to autoencoding. The goal of an AE can be summarised as follows: to project the inputs features into a new space (known as the latent space) where the projection has more desirable properties than the input representation.
These properties may include:
The architecture for a simple traditional standard AE.
The architecture for a GAE. The input graph is described by the adjacency matrix \(A\) and the vertex feature matrix \(X\).
Description
Rather than representing inputs with single points in latent space, variational autoencoders (VAEs) learn to encode inputs as probability distributions in latent space.
An example of a VGAE. Graph inputs are encoded via a GNN into multivariate Guassian parameters (i.e. mean and variance).
Description
Graph Adversarial Techniques (GAdvTs) use adversarial learning methods whereby an AI model acts as an adversary to another during training to mutually improve the performance of both models in tandem.
A typical approach to adversarial training with VGAEs.
Equations depending on type.
Neats vs Scruffies (Norvig)
Neats: AI theories should be grounded in mathematical rigor.
Scruffies: try out lots of ideas, write some programs, and then assess what seems to be working.
Human-environment interaction
Perception – transforming sensory inputs from their environment into symbols. Roughly maps to neural networks.
Cognition – mapping symbols to knowledge about the environment for supporting abstraction, reasoning by analogy, and long-term planning. Roughly maps to symbolic AI.
These are Daniel Kahneman’s System 1 and System 2.
Why add symbolic AI?
Deep Learning Criticisms
Two types:
Two methods for compressing knowledge graphs.
Challenges
Implementations