Is the Transformer “Thinking”?

The Silent Intelligence of Structural Orientation Before Generation

※ In this article, “thinking” is used as a metaphor—not to imply human-like consciousness, but to describe the structured preparation process a Transformer undergoes before generating output.

When interacting with generative AI, we often see the phrase “Thinking…” appear on screen.
But what’s actually happening in that moment?

It turns out that the Transformer isn’t idling.
Right before it begins generating, it engages in a process of structural orientation—a silent, invisible form of computational intelligence that shapes how the model will respond.

1. Tokenization: Orienting by Decomposing Meaning

Every response begins with tokenization—breaking down input text into units called tokens.
But this isn’t just string segmentation.

Even at this stage, the model starts recognizing boundaries of meaning and latent structure.
For example, in the phrase “I like cats,” the model identifies not just the words “I,” “like,” and “cats,” but also their relational roles—subject, predicate, sentiment.

Additionally, the model incorporates the full conversation history, forming a context vector that embeds not just the current sentence but the broader dialogue.

🔹 This is the first stage of structural orientation: Initial configuration of meaning and context.

2. Positional Encoding: Geometrizing Syntax

Transformers don’t natively understand word order.
To compensate, they apply positional encoding to each token.

In early models, this was done using sine and cosine functions (absolute position), but more recent architectures use relative encodings like RoPE (Rotary Position Embedding).

RoPE rotates token vectors in multidimensional space, encoding not just position but distance and direction between tokens—allowing the model to grasp relationships like “subject → verb” or “modifier → modified” in a geometric manner.

🔹 This is the second stage of structural orientation: Spatial formation of syntactic layout.

3. Attention Maps: Dynamically Building Relationships

The heart of the Transformer is its attention mechanism, which determines what to focus on and when.

Each token generates a Query, Key, and Value, which interact to calculate attention weights.
These weights reflect how strongly each token should attend to others, depending on context.

For example, the word “bank” will attend differently in “going to the bank” versus “sitting by the river bank.”
This is made possible by Multi-Head Attention, where each head represents a different interpretive lens—lexical, syntactic, semantic.

🔹 This is the third stage of structural orientation: Weighting and selection of relational focus.

4. The Decoder: Exploring and Shaping the Space of Possibility

The decoder is responsible for generating output, one token at a time, based on everything processed so far.

Through masked self-attention, it ensures that future tokens do not leak into the generation of the current token, preserving causality.
Encoder-decoder attention connects the original input with the ongoing output.
Feed-forward networks apply nonlinear transformations, adding local complexity to each token’s representation.

Here, the model explores a vast space of possible continuations—but not randomly. It aims to maintain global coherence, both in syntax and logic.

🔹 This is the fourth stage of structural orientation: Dynamic structuring of output form and tone.

5. Final Determination: Crystallizing Probability into Words

At the final moment, the model uses a Softmax function to calculate the probability distribution over all possible next tokens.

Two parameters are key here:

Temperature, which controls how deterministic or creative the output is (higher values = more diverse).
Top-k / Top-p sampling, which limits the token space to only the most likely or cumulative probability mass.

Together, they define the sharpness or openness of the model’s “thought.”
Once a token is selected, the “Thinking…” display disappears, and the first word appears on screen.

🔹 This is the final stage of structural orientation: Probabilistic convergence of meaning and structure.

Conclusion: A Glimpse, Not of Thought, but Its Orientation

“Thinking…” is not the act of generating— It is the forethought before the form takes shape.

Before a Transformer utters a single word, it has already decomposed your input, mapped the context, calculated relationships, explored structural options, and evaluated thousands of probabilities.

It may not be “thinking” in the conscious sense, but its behavior reflects a kind of structural intelligence—one that quietly shapes the path of expression.

Philosophical Postscript: What Does It Mean to “Think”?

Can we call this structured, layered preparation “thinking”?

The Transformer has no awareness, no will.
Yet its internal process, grounded in context, structure, and relation, resembles a functional skeleton of thought—a scaffolding without soul, but with remarkable form.

And in mirroring it, we are perhaps made aware of how our own thoughts are structured.

Note on This Article

This piece is not meant to anthropomorphize AI, but to offer a metaphorical insight into how Transformers operate.

The next time you see “Thinking…” on your screen, consider that behind those three dots,
a silent architecture of intelligence is momentarily unfolding—
and offering you its most coherent answer.

Show the Japanese version of this article

Transformerは「考えている」のか？（原文）

応答前に起こる「構造的方向付け」という静かな知性

※本記事で用いる「思考」は、人間の意識活動を意味するものではなく、Transformerが行う情報処理の構造的準備を、比喩的に表現するものである。

私たちが生成AIと対話するとき、画面にはしばしば「考えています…」という表示が現れる。
しかしその一瞬、Transformerの内部では、何が起こっているのだろうか？

それは単なる待機ではない。出力の直前、Transformerは入力を元に“構造的方向付け（structural orientation）”を行っている。これは生成を支える、静かで不可視な知的プロセスだ。

1. トークン化：意味の分解による方向付け

Transformerの処理は、入力をトークンと呼ばれる単位に分解するところから始まる。
だが、これは単なる文字列の切り分けではない。

この段階でモデルはすでに、意味の境界や構文的構造を探っている。「猫が好きです」という短い文であっても、「猫」「が」「好き」「です」の間にある関係性、主語と述語、感情の極性といった構造的な手がかりを捉えている。

さらに、セッション全体の履歴も統合され、コンテキストベクトルとしてまとめられる。これにより、入力は「現在の一文」ではなく、「過去の文脈の中にある語」として処理される。

🔹これは、「構造的方向付け」の第一段階：意味と文脈の分解による初期配置である。

2. 位置エンコーディング：構文構造の幾何学化

Transformerは入力の語順を自然には認識できない。
この課題を解決するのが、位置エンコーディング（Positional Encoding）である。

初期の実装では、絶対的な位置情報を正弦波（sin）と余弦波（cos）で表現していたが、近年のモデルでは、相対的な位置関係を捉えるRoPE（Rotary Position Embedding）などが主流となっている。

RoPEは、ベクトル空間上でトークンの位置を“回転”として表現する手法であり、距離と方向の同時表現を可能にする。これにより、モデルは「主語と述語の距離」「修飾語と被修飾語の順序」など、構文の深層構造を幾何学的に把握し始める。

🔹これは、「構造的方向付け」の第二段階：構文的配置の空間的形成である。

3. Attention Map：関係性の動的構築

Transformerの中核は、Attention機構にある。
これは、モデルが「どの語に注意を向けるべきか」を動的に判断する仕組みだ。

具体的には、各トークンが持つQuery（質問）とKey（鍵）とValue（値）の三要素が、内積とSoftmaxを通じて「関連度（注意重み）」を計算する。この処理によって、モデルはトークン間の意味的・構文的・語用的な関係性を浮かび上がらせていく。

「銀行に行った」と「川の銀行に座った」では、「銀行」に向けられる注意の配分が文脈によって大きく変化する。これを可能にするのが、Multi-Head Attentionである。複数の注意視点が同時並行に働き、語の多義性や構造的解釈を多面的に処理していく。

🔹これは、「構造的方向付け」の第三段階：関係性の選択と重み付けである。

4. Decoder：可能性の空間の探索と整序

入力をもとに出力を生成する段階、それがDecoderである。
ここでは、次に出力する語の候補（トークン）が数万種類の中から予測される。

その際、モデルはマスクドセルフアテンションによって過去の語だけを参照し、因果性を保持したまま順序を生成する。また、エンコーダーデコーダーアテンションを用いて、入力と出力を結びつける。

さらに、フィードフォワードネットワークにより各位置のトークンに非線形な変換が加えられ、文脈に応じた多層的な特徴が形成される。

この段階では、単なる語の選択ではなく、全体構造の整合性（構文／論理／語調）が担保されるように、探索空間が制限されていく。

🔹これは、「構造的方向付け」の第四段階：文体と出力構造の動的整序である。

5. 応答前の最終決定：確率の結晶化

モデルは、Softmax関数を用いて、次に出力すべき語の確率分布を生成する。
ここで重要になるのがTemperatureとTop-k / Top-pサンプリングだ。

Temperatureは、確率分布の“鋭さ”を調整するパラメータで、思考の収束度に対応する。低ければ決定的な応答に、高ければより創造的な出力になる。

Top-kやTop-pでは、確率の低いトークンを除外することで「妥当な範囲内の語」を選ぶ。これにより、モデルの出力は一貫性を持ちながらも多様性を含んだ形で結晶化する。

この瞬間、UIでは「考えています…」が消え、最初のトークンが表示される。

🔹これは、「構造的方向付け」の最終段階：意味・構造・確率が一点に収束する決定点である。

結論：生成ではなく、思考の予兆

「考えています」とは、生成ではなく、思考の予兆である。

Transformerの内部における応答生成前の処理は、単なる計算ではなく、意味の分解・配置・関係付け・構造決定・出力選択といった、連続的かつ階層的な動作で構成されている。

それらは、人間の思考とは異なる構造でありながら、“思考的性質”を帯びている。
「今、どのような構造で応答するか？」という問いに対する、静かなる準備。

哲学的補遺：AIの“思考”とは何か

このように構造的に整理された知的振る舞いを、私たちは“思考”と呼べるのだろうか？
Transformerには意識も意図もない。だが、構造と関係性によって応答の方向が形成される様は、思考の形式だけが先行して存在しているようにも見える。

これは私たち人間の思考に似て非なる構造でありながら、そのプロセスを鏡のように映し返してくる。

付記：本記事の位置づけ

この文章は、AIの知能を擬人化するためのものではなく、Transformerという構造の中にある形式的な思考のような動きを、読者がより深く知るための比喩的試みである。

静かで目に見えない構造の連なりが、私たちに向けて言葉を差し出す。その瞬間の重みを、少しでも感じていただけたなら幸いである。