TY - GEN
T1 - Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons
AU - So, Chi Chiu
AU - Sun, Yueyue
AU - Wang, Jun Min
AU - Yung, Siu Pang
AU - Loh, Anthony Wai Keung
AU - Chau, Chun Pong
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - How far are Large Language Models (LLMs) in performing deep relational reasoning? In this paper, we evaluate and compare the reasoning capabilities of three cutting-edge LLMs, namely, DeepSeek-R1, DeepSeek-V3 and GPT-4o, through a suite of carefully designed benchmark tasks in family tree and general graph reasoning. Our experiments reveal that DeepSeekR1 consistently achieves the highest F1-scores across multiple tasks and problem sizes, demonstrating strong aptitude in logical deduction and relational inference. However, all evaluated models, including DeepSeek-R1, struggle significantly as problem complexity increases, largely due to token length limitations and incomplete output structures. A detailed analysis of DeepSeekR1's long Chain-of-Thought responses uncovers its unique planning and verification strategies, but also highlights instances of incoherent or incomplete reasoning, calling attention to the need for deeper scrutiny into LLMs' internal inference dynamics. We further discuss key directions for future work, including the role of multimodal reasoning and the systematic examination of reasoning failures. Our findings provide both empirical insights and theoretical implications for advancing LLMs' reasoning abilities, particularly in tasks that demand structured, multi-step logical inference. Our code repository will be publicly available at https://github.com/kelvinhkcs/Deep-Relational-Reasoning.
AB - How far are Large Language Models (LLMs) in performing deep relational reasoning? In this paper, we evaluate and compare the reasoning capabilities of three cutting-edge LLMs, namely, DeepSeek-R1, DeepSeek-V3 and GPT-4o, through a suite of carefully designed benchmark tasks in family tree and general graph reasoning. Our experiments reveal that DeepSeekR1 consistently achieves the highest F1-scores across multiple tasks and problem sizes, demonstrating strong aptitude in logical deduction and relational inference. However, all evaluated models, including DeepSeek-R1, struggle significantly as problem complexity increases, largely due to token length limitations and incomplete output structures. A detailed analysis of DeepSeekR1's long Chain-of-Thought responses uncovers its unique planning and verification strategies, but also highlights instances of incoherent or incomplete reasoning, calling attention to the need for deeper scrutiny into LLMs' internal inference dynamics. We further discuss key directions for future work, including the role of multimodal reasoning and the systematic examination of reasoning failures. Our findings provide both empirical insights and theoretical implications for advancing LLMs' reasoning abilities, particularly in tasks that demand structured, multi-step logical inference. Our code repository will be publicly available at https://github.com/kelvinhkcs/Deep-Relational-Reasoning.
KW - Chain-of-Thought (CoT)
KW - Deep Reasoning
KW - DeepSeek
KW - Large Language Models (LLMs)
KW - Relational Reasoning
UR - https://www.scopus.com/pages/publications/105016257924
UR - https://www.mendeley.com/catalogue/a444e5b2-db0d-3369-a420-14a1dded503a/
U2 - 10.1109/AITest66680.2025.00028
DO - 10.1109/AITest66680.2025.00028
M3 - Conference contribution
AN - SCOPUS:105016257924
T3 - Proceedings - 2025 IEEE International Conference on Artificial Intelligence Testing, AITest 2025
SP - 168
EP - 177
BT - Proceedings - 2025 IEEE International Conference on Artificial Intelligence Testing, AITest 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th IEEE International Conference on Artificial Intelligence Testing, AITest 2025
Y2 - 21 July 2025 through 24 July 2025
ER -