← Base merges and moerges LLM Model size →

Precision in Machine Learning

Precision is a metric that measures the proportion of accurate predictions in both positive groups. It’s important to review the confusion matrix before moving towards precision and recall on machine learning.

float16: Also known as half-precision, it is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, such as image processing and neural networks.
bfloat16: This is a 16-bit floating point format used primarily in deep learning applications for its performance benefits. It has the same exponent as a 32-bit float, but only an 8-bit mantissa. This allows it to have a similar range as 32-bit floats, but with less precision.
8bit: In the context of machine learning, 8-bit precision often refers to the practice of quantizing model weights and activations to 8-bit integers (int8) during inference. This reduces both memory and computing requirements, enabling larger models to fit into memory and speeding up inference.
4bit: 4-bit precision is used in extreme quantization scenarios where model weights are quantized to 4 bits. This significantly reduces the memory footprint of the model, enabling it to run on devices with limited memory.
GPTQ: GPTQ is a one-shot weight quantization method based on approximate second-order information. It can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bit-width down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline.

Areas of application

float16: Image processing, neural networks, computer graphics, etc.
bfloat16: Deep learning, machine learning, cloud computing, etc.
8bit: Model inference, edge computing, mobile devices, etc.
4bit: Extreme quantization, large language models, memory-constrained devices, etc.
GPTQ: Quantizing GPT models, natural language generation, text summarization.

Example

float16: An example of using float16 is in image processing, where the pixel values are typically stored as 8-bit integers. However, some operations, such as convolution, may require higher precision to avoid rounding errors. In this case, converting the pixel values to float16 can provide enough precision without using too much memory or computation.
bfloat16: An example of using bfloat16 is in deep learning, where the model parameters and activations are often stored as 32-bit floats. However, this can be inefficient and costly, especially for large models. By using bfloat16, you can reduce the memory and bandwidth requirements by half, while preserving the dynamic range of 32-bit floats. This can improve the performance and scalability of deep learning applications.
8bit: An example of using 8bit precision is in model inference, where the model weights and activations are quantized to 8-bit integers. This can reduce the memory and computation requirements by a factor of four, compared to 32-bit floats. This can enable larger models to run on devices with limited resources, such as mobile phones or edge devices.
4bit: An example of using 4bit precision is in extreme quantization, where the model weights are quantized to 4 bits. This can reduce the memory footprint of the model by a factor of eight, compared to 32-bit floats. This can enable very large models, such as GPT-3, to run on devices with limited memory, such as GPUs or TPUs.
GPTQ: An example of using GPTQ is in quantizing GPT models, which are large language models that can generate natural language texts. GPTQ can quantize the model weights to 3 or 4 bits per weight, using approximate second-order information to optimize the quantization error. This can reduce the model size by a factor of eight or ten, compared to 32-bit floats, while maintaining the model quality.

Resources

← Base merges and moerges LLM Model size →