Precision is a metric that measures the proportion of accurate predictions in both positive groups. It’s important to review the confusion matrix before moving towards precision and recall on machine learning.
- float16: Also known as half-precision, it is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, such as image processing and neural networks.
- bfloat16: This is a 16-bit floating point format used primarily in deep learning applications for its performance benefits. It has the same exponent as a 32-bit float, but only an 8-bit mantissa. This allows it to have a similar range as 32-bit floats, but with less precision.
- 8bit: In the context of machine learning, 8-bit precision often refers to the practice of quantizing model weights and activations to 8-bit integers (int8) during inference. This reduces both memory and computing requirements, enabling larger models to fit into memory and speeding up inference.
- 4bit: 4-bit precision is used in extreme quantization scenarios where model weights are quantized to 4 bits. This significantly reduces the memory footprint of the model, enabling it to run on devices with limited memory.
- GPTQ: GPTQ is a one-shot weight quantization method based on approximate second-order information. It can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bit-width down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline.