ALEX.BLOG

Experimental Model: google/gemma-3-4b-it (bfloat16)

Evaluation Set: GSM8K (Test Subset n=200)

1. Introduction: When Scaling Laws Meet "Inference-Time Compute"

Over the past two years, optimization in the large language model (LLM) industry has primarily focused on the Pre-training and Post-training stages. As the marginal efficiency of Scaling Laws begins to shift, the technical narrative of LLMs is undergoing a subtle transformation. We are no longer solely fixated on the Scaling Laws of trillion-parameter models; instead, we are shifting our focus toward model efficiency, inference potential, localized privacy, and long-chain reasoning for Agents.

For a model like Gemma-3-4B, which is deployed on consumer-grade GPUs or even mobile devices, the physical constraints of its parameter count mean it cannot "remember" the same vast amount of implicit knowledge as a 70B model. A direction worth exploring is how to utilize context as an implicit gradient to dynamically guide the model's attention during the inference process. This experiment explores the dynamic adaptation capabilities of Gemma-3-4B when handling multi-step logical reasoning tasks, specifically comparing the effectiveness of Explicit Chain-of-Thought (CoT) versus Retrieval-Augmented Generation (RAG) in the mathematical domain. It evaluates the performance of different In-Context Learning (ICL) strategies on mathematical reasoning tasks.

2. Experimental Design and MethodologyTo

ensure engineering rigor, this experiment established an automated evaluation pipeline.

Base Model: Gemma-3-4B-Instruct, loaded in bfloat16 to simulate a real-world edge-side inference environment.
Task Target: GSM8K, with 200 test samples randomly selected.
Evaluation Metric: Exact Match (EM). A Regex-based script was written to automatically extract numerical answers from the output for comparison with the ground truth.

Four sets of comparative experiments were designed, representing different paradigms of current ICL:

Group A: Zero-Shot (Baseline)
Prompt: Question: {question}\nAnswer:
Purpose: To determine the baseline of the model's built-in parameter knowledge.
Group B: Few-Shot (Random)
Prompt: 3 samples {Q, A} randomly selected from the training set as a prefix.
Purpose: To activate in-context learning capabilities and standardize output format.
Group C: Few-Shot (RAG - Semantic)
Prompt: Use sentence-transformers (all-MiniLM-L6-v2) to retrieve the 3 samples most semantically similar to the current question.
Purpose: To test the effectiveness of "knowledge/semantic enhancement" on logical reasoning.
Group D: Chain-of-Thought (CoT)
Prompt: 3-shot + forcing the model to output "Let's think step by step..." in the Answer section and demonstrate the full process.
Purpose: To activate "System 2" slow-thinking mode.

3. Core Experimental ResultsThe

results demonstrate significant hierarchical differences, with statistics as follows:

Strategy	Accuracy	Lift (Relative to Baseline)	Core Characteristics
Zero-Shot	48.5%	-	Relies on shallow activation
Few-Shot (RAG)	55.0%	+13.4%	Semantic retrieval scored lower than random
Few-Shot (Random)	57.5%	+18.6%	Format standardization brings significant gains
CoT (Chain-of-Thought)	67.0%	+38.1%	Optimal under current experimental conditions

4. Deep Insights

💡 Insight 1: The Essence of CoT is Reducing Conditional EntropyIn

Zero-Shot mode, the model attempts to directly fit an extremely complex distribution P(y|x). For a model like 4B with relatively few layers, performing a direct non-linear mapping from x (the question) to y (the final number) is highly difficult.

Chain-of-Thought (CoT) brings a qualitative leap to complex logical tasks. This validates the theory by Wei et al. (2022): by generating intermediate steps z_t, the model essentially decomposes a high-difficulty goal into a series of conditional probability chains:

Introducing intermediate steps z_t during calculation significantly reduces the conditional entropy of each prediction step, making every part of the complex logical chain easier to predict. This is equivalent to expanding the model’s limited "working memory." This mode corresponds to "System 2" (slow thinking) as proposed by Daniel Kahneman.

Result Sample (Sample ID 63): * Question: Aleena's streaming service costs $140/month. She pays full price for the first half of the year and gets a 10% discount for the second half. Calculate the total cost.

Zero-Shot (Fail): The model attempts a direct calculation and gives an incorrect answer of 966.
CoT (Success):

"First, calculate the cost for the first half: 140 * 6 = 840. Next, calculate the discounted price: 140 * 0.9 = 126. Then, cost for second half: 126 * 6 = 756. Total: 840 + 756 = 1596."

💡 Insight 2: Semantic Similarity Does Not Equal Logical IsomorphismAnother

noteworthy data point from this experiment: Well-designed RAG (55.0%) performed slightly worse than random sampling (57.5%).

Deep Attribution: RAG (semantic retrieval) tends to select samples with "overlapping keywords" rather than those with "matching logical structures."

For example, if the test question is: "5 students stand in a line for a photo; how many different orders are possible? (Factorial problem, 5!)"

The semantic model might retrieve: "While standing in line for a photo, Xiao Ming has 2 people in front of him and 3 people behind him. How many people are in the line? (Addition problem: 2+1+3)"

To a language model, the vocabulary is nearly identical, but in mathematical reasoning, these are two completely different structures.

This imposes a "wrong reasoning template bias" on the model. In logic-intensive tasks like GSM8K, similarity in surface features actually acts as interference. The model may imitate the incorrect solution path of the retrieved sample, leading to a decrease in accuracy.

💡 Insight 3: ICL Can Be Interpreted as a Gradient-like Implicit Adaptation MechanismEven

without Chain-of-Thought (CoT), providing a few random samples significantly improves performance. Borrowing from the theory of Xie et al. (Stanford, 2022), the goal of ICL can be formalized as a type of implicit Bayesian inference:

From this perspective, the Context C is not updating the model parameters W, but rather using the Attention mechanism to perform "task localization." This guides the model to quickly align with a specific task subspace, effectively avoiding formatting errors common in Zero-shot reasoning.

5. Next Steps

The performance of Gemma-3-4B in this preliminary experiment demonstrates that small models can enhance reasoning through "externalizable intermediate states."

There are three takeaways for future engineering implementation:

Compute for Intelligence: On edge devices, rather than pursuing larger models, it is better to consume more tokens at the inference end (using CoT) to trade for higher accuracy.
Reconstructing RAG: In mathematical, coding, and reasoning scenarios, we should not limit ourselves to general semantic vector models. We still need to explore more diverse retrieval schemes, such as those based on code AST (Abstract Syntax Trees) or ontologies.
Future Direction: If CoT remains limited by language model hallucinations, then PoT (Program of Thought)—letting the model generate Python code and Agent actions instead of natural language reasoning—will be a viable path for small-scale models to push their logical ceilings.

Appendix: Full experimental code and data have been saved to gemma_icl_results.json. https://www.neura.pub/static/uploads/gemma_icl_results.json References: Wei et al. (2022) Chain-of-Thought Prompting; Xie et al. (Stanford, 2022) "Why Can GPT Learn In-Context?".