Understand

What are LLMs and How Do They Learn?

Overview

Large Language Models (LLMs) are a type of GenAI that learn patterns from vast amounts of text data. They are enormous in terms of both computing power and the amount of training data used. 

The main job of an LLM is to predict which words are likely to appear next in a given phrase or sentence. At a basic level, this involves recognizing grammatical correctness, like preferring “the dog slept” over “the dog green.” However, modern LLMs consider larger contexts, helping them produce more accurate and nuanced responses.

Check Your Understanding

What else should I know?

Training data comes from:

  • Internet: Public websites. These are diverse but may contain errors, biases, or harmful information.
  • Books and literature: Public domain and copyrighted texts. They offer more refined language but raise concerns over unauthorized use.
  • Wikipedia and knowledge bases: Structured and updated information, useful for factual accuracy.
  • Programming code: Extracted from platforms like GitHub, helpful for coding tasks.
  • Images: Used in multimodal models, though they may have copyright implications.

Despite generating coherent responses, LLMs do not actually plan or have intentions. They predict each word based solely on probability, which means:

  • The same prompt can lead to different responses.
  • Long responses might lose earlier details, as the model can only consider a limited context.
  • They can’t perform tasks requiring strategic planning, like playing chess effectively, because there’s no forward-thinking involved.

Models learn from real-world data, which means they can reproduce stereotypes or prejudices. For example, if a group is negatively represented in the data, the model may repeat that bias.

To mitigate this, developers:

  • Curate data more carefully.
  • Use bias detection tools.
  • Apply techniques to reduce biased outputs.

During training, the model adjusts its parameters every time it incorrectly predicts a word. This improves its ability to generate coherent text, although it doesn’t understand meaning like a person would.

  • Public data: Easy to access but raises legal and ethical concerns.
  • Licensed data: Reviewed, reliable, and ensures authors are compensated.

Models also learn from user interactions. This means what you write can influence future responses. Therefore, it’s important to:

  • Avoid sharing personal or confidential information.
  • Review how each service handles your data.