Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

How far are Large Language Models (LLMs) in performing deep relational reasoning? In this paper, we evaluate and compare the reasoning capabilities of three cutting-edge LLMs, namely, DeepSeek-R1, DeepSeek-V3 and GPT-4o, through a suite of carefully designed benchmark tasks in family tree and general graph reasoning. Our experiments reveal that DeepSeekR1 consistently achieves the highest F1-scores across multiple tasks and problem sizes, demonstrating strong aptitude in logical deduction and relational inference. However, all evaluated models, including DeepSeek-R1, struggle significantly as problem complexity increases, largely due to token length limitations and incomplete output structures. A detailed analysis of DeepSeekR1's long Chain-of-Thought responses uncovers its unique planning and verification strategies, but also highlights instances of incoherent or incomplete reasoning, calling attention to the need for deeper scrutiny into LLMs' internal inference dynamics. We further discuss key directions for future work, including the role of multimodal reasoning and the systematic examination of reasoning failures. Our findings provide both empirical insights and theoretical implications for advancing LLMs' reasoning abilities, particularly in tasks that demand structured, multi-step logical inference. Our code repository will be publicly available at https://github.com/kelvinhkcs/Deep-Relational-Reasoning.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE International Conference on Artificial Intelligence Testing, AITest 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages168-177
Number of pages10
ISBN (Electronic)9798331589134
DOIs
Publication statusPublished - 2025
Event7th IEEE International Conference on Artificial Intelligence Testing, AITest 2025 - Tucson, United States
Duration: 21 Jul 202524 Jul 2025

Publication series

NameProceedings - 2025 IEEE International Conference on Artificial Intelligence Testing, AITest 2025

Conference

Conference7th IEEE International Conference on Artificial Intelligence Testing, AITest 2025
Country/TerritoryUnited States
CityTucson
Period21/07/2524/07/25

Keywords

  • Chain-of-Thought (CoT)
  • Deep Reasoning
  • DeepSeek
  • Large Language Models (LLMs)
  • Relational Reasoning

Fingerprint

Dive into the research topics of 'Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons'. Together they form a unique fingerprint.

Cite this