
ChatGPT, Claude, Perplexity for Procurement: Where They Hallucinate
Stress test of ChatGPT, Claude, and Perplexity on real procurement tasks. Where each model works (narrative, drafting, web research) and where each model fails (cross-document spec extraction, unit normalization, page-level citations).
Priya Sharma
Procurement Technology Lead, SpecLens
- 850+companies trust SpecLens
- 99%extraction accuracy
- 8 hrssaved per comparison
- AES-256encrypted · GDPR compliant
Key takeaways
- ChatGPT, Claude, and Perplexity all hallucinate predictably on the same procurement-specific task: cross-document spec extraction with page-level citations and unit normalization.
- Three failure categories matter most for procurement: extraction errors (wrong values), citation hallucinations (fabricated page references), and calculation errors (wrong arithmetic on extracted inputs).
- Use ChatGPT and Claude for narrative tasks (summary, RFP draft); use Perplexity for web-grounded pre-RFP research; use specification intelligence for evaluation tasks (multi-vendor comparison, gap analysis, decision-meeting matrix).
- Claude flags uncertainty more often than ChatGPT — a meaningful procurement-audit advantage — but still lacks page-level citation infrastructure that procurement audits require.
- The hallucination floor is structural: general-purpose AI is built to produce plausible text; specification intelligence is built to produce verifiable extraction with citations. Different categories of tool, different failure modes.
What Happens When You Test the Big Three on Real Procurement Documents
Procurement teams have started running ChatGPT, Claude, and Perplexity on vendor proposals, RFP responses, and spec sheets. The results are uneven. Each model handles some procurement tasks well and others badly — and the boundary between "use it" and "don't use it" is not where the marketing leads you to believe. Glean's document-understanding analysis observed that maintaining context across multiple files is where general-purpose AI breaks down; Evolution AI's testing reports roughly one hallucination per page on document-comparison tasks at scale.
This is the procurement-specific stress test for the three leading general-purpose AI tools — what they get right, where they hallucinate, and the exact category of procurement task each handles best in 2026.
Quick Answer: Where Each Model Works and Where It Fails
ChatGPT handles single-document narrative summaries reliably; hallucinates spec values when asked to compare across multiple documents, especially numbers and citations. Claude is the most cautious of the three on direct numeric extraction and openly flags uncertainty; still struggles with cross-document normalization. Perplexity is strongest on web-grounded research (analyst reports, vendor positioning) with citations to public sources but is not a document-extraction tool. None of the three carries page-level citations to user-uploaded vendor PDFs at procurement-audit grade. For multi-vendor comparison with citations, normalization, and exportable matrices, the right tool category is purpose-built specification intelligence.
The Three Categories of AI Failure That Matter for Procurement
Before testing any model, define what failure looks like. Three categories of error matter most for procurement use:
1. Extraction errors. The model reads a number from a vendor PDF and reports a different number. Cause: OCR errors on scanned PDFs, table-parsing errors on multi-column layouts, footnote misattribution. Failure mode: matrix cells are wrong; subsequent decisions rest on incorrect data.
2. Citation hallucinations. The model reports a number with a fabricated citation — a page reference that doesn't exist in the source document, or a quote that doesn't appear in the source. Failure mode: the audit trail is fictional; a stakeholder challenge collapses the citation.
3. Calculation errors. The model performs math on extracted values and produces an arithmetic result that is wrong. Common in TCO calculations, unit conversions, and weighted-score computations. Failure mode: matrix totals are wrong; vendor rankings flip silently.
Procurement tooling has to fail safely on all three. General-purpose AI fails commonly on all three; purpose-built specification intelligence fails rarely or never on the first two and surfaces uncertainty explicitly on the third.
ChatGPT on Procurement Tasks — Where It Works
ChatGPT (OpenAI's flagship model line) handles three procurement-context tasks well in 2026:
- Single-document narrative summary. Upload a single vendor proposal; ask for a 200-word executive summary of approach, methodology, and team. ChatGPT produces a coherent, qualitatively useful summary that procurement can use as input to evaluation.
- RFP draft generation. Given a category and an objective, ChatGPT drafts a defensible RFP structure that procurement edits to fit the project. Saves 2-4 hours of drafting time per RFP.
- Vendor research synthesis. Given a vendor name, ChatGPT synthesizes publicly available positioning information from its training data — useful for pre-RFP qualification work.
See ChatGPT for procurement for the broader use-case mapping.
ChatGPT on Procurement Tasks — Where It Fails
ChatGPT consistently fails on three procurement-context tasks:
1. Multi-document spec extraction. Upload three vendor PDFs and ask "extract all IOPS values from each." ChatGPT returns plausible-looking values, several of which are extraction errors (numbers that don't appear in the document) or citation hallucinations (page references that don't exist). Evolution AI's testing puts the rate at roughly one error per page on document-comparison tasks. For a 50-page vendor proposal evaluated three at a time, that is roughly 150 errors per cycle.
2. Cross-vendor normalization. Ask "is vendor A's 100,000 IOPS comparable to vendor B's 100,000 IOPS?" ChatGPT can describe what the question means but cannot reliably answer it from the source documents. Conditions, block sizes, and queue-depth assumptions buried in vendor footnotes are routinely missed.
3. TCO calculation with hallucinated inputs. Ask ChatGPT to compute 5-year TCO with extracted vendor data. The arithmetic is sometimes correct; the input values are sometimes hallucinated. Output looks defensible until a stakeholder challenges a specific input number.
Claude on Procurement Tasks — Where It Works
Anthropic's Claude (the model family this post is being authored on) handles three procurement-context tasks well:
- Long-context document analysis. Claude's long-context capability handles 200-page vendor proposals coherently. For single-document deep analysis, the long-context advantage matters.
- Uncertainty acknowledgment. Claude is the most likely of the three to flag "I don't see this value in the source document" or "the document doesn't specify the measurement condition" rather than fabricating a confident answer. Procurement audits favor uncertainty acknowledgment over false confidence.
- Long-form RFP drafting. Like ChatGPT, Claude drafts coherent RFPs from a category and objective; the long-context advantage helps for complex multi-section RFPs.
Claude on Procurement Tasks — Where It Fails
Claude shares the cross-document normalization failure mode with ChatGPT. The structural problem is the same: general-purpose AI does not have purpose-built normalization logic for procurement-specific spec types (IOPS at different conditions, MRI field strength vs gradient performance, range under different payload assumptions).
Claude also lacks the citation infrastructure procurement audits require. While Claude is more likely than ChatGPT to acknowledge uncertainty, the values it does produce do not carry click-through citations to the source document with page references — the procurement-audit standard. The ChatGPT vs Claude vs Copilot for procurement comparison covers the broader tradeoff.
Perplexity on Procurement Tasks — Where It Works
Perplexity is the strongest of the three on web-grounded research with public-source citations:
- Vendor positioning research. Ask "summarize the current 2026 positioning of Dell PowerStore Gen 2" and Perplexity returns a synthesis with citations to Dell's product pages, IDC tracker reports, and analyst summaries.
- Industry trend research. "What did the Hackett 2026 Procurement Key Issues Study report on AI deployment?" Perplexity returns a citation-backed synthesis from public sources.
- Pre-RFP qualification. Synthesize public information about vendor reputation, recent funding, customer wins, and product roadmap.
For pre-RFP research, Perplexity is a useful tool. For evaluation of vendor responses, it is not.
Perplexity on Procurement Tasks — Where It Fails
Perplexity is not built for user-uploaded document analysis. It is web-grounded; the citations are to public web sources. For internal procurement documents (vendor proposals, RFP responses, security questionnaires) that should never be exposed to web search, Perplexity is the wrong category of tool.
Perplexity also does not produce structured comparison matrices, run gap analysis against an RFP baseline, or export to Excel/PDF/PowerPoint. It is a research synthesis tool, not a procurement evaluation platform.
The Side-by-Side Stress Test
| Task | ChatGPT | Claude | Perplexity | SpecLens |
|---|---|---|---|---|
| Single-doc narrative summary | Reliable | Reliable | Limited (web-grounded only) | Reliable |
| Multi-doc spec extraction with citations | Hallucinates | Hallucinates less, no citations | Not designed for it | Reliable with page citations |
| Cross-vendor unit normalization | Fails silently | Fails silently | Not designed for it | Automated |
| Gap analysis vs RFP baseline | Manual prompt engineering | Manual prompt engineering | Not designed for it | Automated |
| Vendor positioning research (web) | Reasonable from training data | Reasonable from training data | Strongest | Not its function |
| RFP draft generation | Reliable | Reliable, long-context strong | Limited | Not its function |
| Excel/PDF/PowerPoint export with citations | No | No | No | Yes |
| Procurement audit-grade trail | No | No | Public-web only | Yes |
The Pattern That Recurs Across All Three Models
General-purpose AI fails predictably at the same procurement-specific task: cross-document spec extraction with page-level citations and unit normalization. None of the three was built to do this well; the failures are structural, not transient.
The reason: general-purpose AI is trained on web text and document text broadly, with no procurement-specific structure imposed on the output. Specification intelligence platforms are built around the procurement-specific structure — page-level citations, confidence scoring, unit normalization, gap analysis against an RFP baseline, exportable matrix output. Different category of tool.
The right way to think about it: general-purpose AI is the right tool for procurement research tasks (vendor positioning, industry trend synthesis, RFP drafting). Specification intelligence is the right tool for procurement evaluation tasks (multi-vendor spec comparison, gap analysis, decision-meeting matrix). Mature procurement teams use both, for different parts of the workflow.
How to Use General-Purpose AI in Procurement Without Getting Burned
Five rules drawn from the testing pattern:
- Use ChatGPT and Claude for narrative tasks; never for citations. Single-document summary, RFP draft, qualitative analysis — fine. Spec extraction with page citation — never.
- Use Perplexity for pre-RFP research only. Vendor positioning, analyst report synthesis, industry trend research with public-web citations. Not for evaluating vendor responses.
- Verify every quantitative claim against the source. If you do extract numbers from a vendor PDF using ChatGPT or Claude, spot-check at least 10% of values against the source document before any number reaches a decision matrix.
- Don't paste confidential procurement documents into consumer AI services. Vendor pricing and proprietary specifications should not flow through services that may retain or train on the input. Confirm enterprise data-handling controls before any vendor document goes through general-purpose AI.
- Use specification intelligence for the comparison-matrix step. The matrix step is where general-purpose AI fails predictably; purpose-built tooling is the right substitute.
The Hallucination Floor — and Why It's Structural
Why do general-purpose models hallucinate spec values? Because they were trained to produce plausible-looking text, not to extract verifiable structured data. When the source document does not clearly contain the requested value — or when the request asks for cross-document comparison the model can't reliably perform — the model produces a confident-looking answer rather than an "I don't know." Claude is best of the three at flagging uncertainty; ChatGPT and Perplexity less so.
The structural fix is to use models that are constrained to verifiable extraction with citations. Specification intelligence platforms enforce this constraint: every value extracted must trace to a specific page in a specific document; values that fail the trace are flagged rather than fabricated. The hallucination floor on a constrained extraction task is meaningfully lower than on free-form text generation.
For the broader category framing, see what is specification intelligence; for the comparison playbook, see how to compare vendor proposals with AI.
Run the Test on Your Own Procurement Documents
Stress-test the AI tool stack on your real procurement documents before relying on any tool for evaluation work. Pick three vendor PDFs from a recent RFP, run them through ChatGPT, Claude, and SpecLens, and compare the outputs against the source documents yourself. The pattern is consistent across teams: general-purpose AI works for research, specification intelligence works for evaluation. Pair with the ChatGPT vs Claude vs Copilot comparison for the use-case mapping and the specification intelligence pillar for the category framing.
References
- 1.Glean — Which AI Models Excel at Document Understanding — Glean — document understanding analysis on multi-file context limits (2025)
- 2.Evolution AI — Use ChatGPT to Compare Documents — Evolution AI — ChatGPT PDF comparison hallucination rate (2025)
Frequently Asked Questions
Related Articles
ChatGPT vs Claude vs Copilot for Procurement
Compare ChatGPT, Claude, and Copilot for procurement. Learn which AI works best for document analysis and vendor evaluation.
GenAI for Vendor Comparison (2026)
Discover how Generative AI is revolutionizing vendor comparison. From automated extraction to hallucination-free analysis, learn the future of procurement.
What Is Specification Intelligence? A Practical Definition
Specification intelligence is the procurement layer that extracts, normalizes, and compares technical specs across vendor documents. Definition, four pillars, use cases, and 2026 buyer's guide.
How to Compare Vendor Proposals with AI: 2026 Playbook
A 7-step AI-assisted workflow for vendor proposal comparison — replacing 8 hours of manual matrix-building with roughly 20 minutes of citation-backed analysis. Tool tradeoffs, normalization, gap analysis.