
Dev tutorial builds llm from scratch by turning shakespeare into vectors—because nothing says innovation like coding your own tokenizer hell
In the second installment of the "Fazendo um LLM do Zero" series, the challenge of teaching a machine to read and understand human language is addressed. The fundamental problem lies in the fact that computers operate on mathematical principles, using zeros and ones, whereas human language is based on complex linguistic structures. To bridge this gap, a technique called Embeddings is used, which transforms text into dense numerical representations, enabling machines to capture semantic meanings. This process involves tokenization, where text is broken down into unique tokens, and Byte Pair Encoding (BPE), which allows the model to handle out-of-vocabulary words by breaking them down into subwords. The pipeline for processing text data involves several steps, including tokenization, ID assignment, and embedding creation, ultimately feeding into a neural network model. By using this approach, the model can learn to predict future text based on past context, paving the way for advanced language understanding capabilities.