LLM Data Processing: Tokens, Embeddings, Temperature, Hallucination - Technical Terms Glossary

Data Processing and Output Generation

These terms are frequently encountered during the interaction between the user and the model.

Tokens

Models read text not word-by-word, but in small chunks called "tokens". Generally, 1000 tokens equal approximately 750 words. This is the unit of currency for LLM compute.

Embeddings

The conversion of words or sentences into numerical vectors that a computer can understand. Semantically similar words are positioned close to each other in this vector space.

Application: Embeddings allow for "Semantic Search". Instead of keyword matching, you can find records based on meaning. This could be applied to search through historical alarm logs in a system like ZMA.

Temperature

A setting that controls the creativity or randomness of the model's output.

Low (0.1): More consistent, logical, and deterministic. Better for code or technical data.
High (0.8+): More creative and unexpected. Better for brainstorming.

Hallucination

The situation where a model confidently fabricates information that is not true.

Warning: In industrial settings, minimizing hallucination is critical. When interpreting sensor data from a GDT Digital Transmitter, an AI system must be strictly grounded (often using RAG) to avoid reporting false faults.