fix(strategy/tot): use ast.literal_eval for model-generated thoughts by Jiangrong-W · Pull Request #2069 · FoundationAgents/MetaGPT

Jiangrong-W · 2026-06-17T09:56:44Z

Features

Harden the Tree-of-Thoughts solver so model-generated output is parsed as data instead of being executed as code.

Root cause. ThoughtSolverBase.generate_thoughts (metagpt/strategy/tot.py) took the model's generated thought block and parsed it with eval() after only stripping the markdown code fence:
- rsp = await self.llm.aask(...) — raw model output;
- thoughts = CodeParser.parse_code(text=rsp) — regex-extracts the fenced block and returns it verbatim, with no literal/AST validation;
- thoughts = eval(thoughts) (sink, metagpt/strategy/tot.py:66 on main).
  Because eval() evaluates arbitrary Python, any non-literal expression in the model output (rather than the expected list of nodes) would run in the host process. No sandbox, literal check, or try/except guarded the call.
Reachability. This path is on by default in both shipped solvers: it is reached from BFSSolver.generate_and_evaluate_nodes and DFSSolver._dfs, so every Tree-of-Thoughts run routes raw LLM output into the sink.
Why parsing, not execution, is intended. The prompt (OUTPUT_FORMAT, metagpt/strategy/tot.py:20) instructs the model to return "strictly a list of nodes, in json format" — i.e. a literal data structure. The eval() was being used purely as a parser, so evaluating code was never the intended behaviour.
Change. Replace eval() with ast.literal_eval(), which only evaluates literals (lists/dicts/strings/numbers) and raises on anything else. The call is wrapped in try/except (ValueError, SyntaxError): on a parse failure we log a warning and fall back to an empty thought list (thoughts = []), which ThoughtTree.update_node already accepts as its default argument (update_node(thought: List[dict] = [])). Well-formed output is unchanged; malformed or unexpected output degrades safely instead of executing or raising an unhandled error. Standard library only (ast); no new dependency.

Feature Docs

Not applicable — internal hardening of an existing code path. No public API or signature changes, no configuration changes.

Influence

Behaviour-preserving for valid model output: JSON / Python list literals of thought nodes parse exactly as before (ast.literal_eval accepts the literal shape the prompt requests).
Removes a code-execution path reachable through both shipped Tree-of-Thoughts solvers (BFSSolver, DFSSolver).
No public API / signature change; standard-library only (ast). Scope is limited to the single parse call in generate_thoughts.

Result

Added tests/metagpt/strategy/test_tot_generate_thoughts_eval.py (stub LLM, no network):

test_generate_thoughts_parses_benign_list — a fenced JSON list of thought nodes is parsed into ThoughtNodes (behaviour preserved).
test_generate_thoughts_does_not_execute_model_code — when the model returns a non-literal payload, no side effect occurs (a sentinel file is asserted absent). This fails on the pre-fix eval() path and passes after the fix.
test_literal_eval_rejects_code — guards the underlying primitive.

$ pytest tests/metagpt/strategy/test_tot_generate_thoughts_eval.py
3 passed

Reverting only metagpt/strategy/tot.py reproduces the prior behaviour:

E       AssertionError: model-supplied code was executed (eval sink still live)
1 failed, 2 passed

Lint on the changed files is clean (ruff, black --line-length 120, isort --profile black).

Other

Minimal, backward-compatible change; commit is DCO signed-off (Signed-off-by).
Code and documentation are in English, consistent with the repository.

ThoughtSolverBase.generate_thoughts parsed the model's generated thought block with eval() after only a cosmetic markdown-fence strip. Because the Tree-of-Thoughts solver feeds raw model output (rsp -> CodeParser.parse_code) into eval(), a model that emits non-literal Python (e.g. a list whose element calls __import__('os').system(...)) executes arbitrary code on the host. No sandbox, try/except, or literal check guarded the call. The model is instructed (OUTPUT_FORMAT) to return a plain JSON/Python list of thought nodes, so the value is always a literal data structure. Replace eval() with ast.literal_eval(), which only evaluates literals and raises on code. Parse failures are logged and degrade to an empty thought list (matching update_node's default), preserving behaviour for well-formed output. Adds a regression test that fails on the previous eval() path (model-supplied code executes) and passes with literal-only parsing, plus a benign-output test confirming valid thought lists still parse. The regression test injects a mocked LLM, but the session-wide autouse llm_mock fixture (tests/conftest.py) still builds a real OpenAILLM for every test, which validates config.llm.proxy via httpx. An earlier-collected module assigns a non-URL object onto the shared config.llm instance and never restores it, so that leaked proxy made llm_mock raise "Proxy protocol must be ..." at setup of these tests. A module-scoped autouse fixture restores config.llm.proxy to a sane value before llm_mock runs and puts it back afterwards, isolating these tests without affecting any other module. Signed-off-by: christop <825583681@qq.com>

Jiangrong-W requested a deployment to unittest June 17, 2026 09:56 — with GitHub Actions Waiting

Jiangrong-W force-pushed the harness-fix/metagpt-eval-exec-tot-generate-thoughts-eval branch from 6f3507d to edeb873 Compare June 17, 2026 10:21

Jiangrong-W requested a deployment to unittest June 17, 2026 10:21 — with GitHub Actions Waiting

Jiangrong-W force-pushed the harness-fix/metagpt-eval-exec-tot-generate-thoughts-eval branch from edeb873 to d3da0dc Compare June 17, 2026 11:06

Jiangrong-W requested a deployment to unittest June 17, 2026 11:06 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(strategy/tot): use ast.literal_eval for model-generated thoughts#2069

fix(strategy/tot): use ast.literal_eval for model-generated thoughts#2069
Jiangrong-W wants to merge 1 commit into
FoundationAgents:mainfrom
Jiangrong-W:harness-fix/metagpt-eval-exec-tot-generate-thoughts-eval

Jiangrong-W commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Jiangrong-W commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant