Exploring the role of large language models in the scientific method: from hypothesis to discovery

0
Exploring the role of large language models in the scientific method: from hypothesis to discovery
  • Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).

    Article 

    Google Scholar 

  • Alkhateeb, A. & Aeon. Can Scientific Discovery Be Automated? The Atlanctic (2017).

  • Jain, M. et al. GFlowNets for AI-driven scientific discovery. Digit. Discov. 2, 557–577 (2023).

    Article 

    Google Scholar 

  • Kitano, H. Nobel turing challenge: creating the engine for scientific discovery. NPJ Syst. Biol. Appl. 7, 29 (2021).

    Article 

    Google Scholar 

  • Cornelio, C. et al. Combining data and theory for derivable scientific discovery with AI-Descartes. Nat. Commun. 14, 1–10 (2023). 2023 14:1.

    Article 

    Google Scholar 

  • Kim, S. et al. Integration of neural network-based symbolic regression in deep learning for scientific discovery. IEEE Trans. Neural Netw. Learn Syst. 32, 4166–4177 (2021).

    Article 

    Google Scholar 

  • Gil, Y., Greaves, M., Hendler, J. & Hirsh, H. Amplify scientific discovery with artificial intelligence: Many human activities are a bottleneck in progress. Science (1979) 346, 171–172 (2014).

    Google Scholar 

  • Kitano, H. Artificial Intelligence to Win the Nobel Prize and Beyond: Creating the Engine for Scientific Discovery. AI Mag. 37, 39–49 (2016).

    Google Scholar 

  • Berens, P., Cranmer, K., Lawrence, N. D., von Luxburg, U. & Montgomery, J. AI for science: an emerging agenda. arXiv preprint arXiv:2303.04217; (2023).

  • Li, Z., Ji, J. & Zhang, Y. From Kepler to Newton: explainable AI for science. arXiv preprint arXiv:2111.12210; (2021).

  • Baker, N. et al. Workshop report on basic research needs for scientific machine learning: core technologies for artificial intelligence. (2019).

  • Manta, C. D., Hu, E. & Bengio, Y. GFlowNets for causal discovery: an overview. OpenReview (ICML, 2023).

  • Vinuesa, R., Brunton, S. L. & McKeon, B. J. The transformative potential of machine learning for experiments in fluid mechanics. Nat. Rev. Phys. 5, 536–545 (2023).

    Article 

    Google Scholar 

  • del Rosario, Z. & del Rosario, M. Synthesizing domain science with machine learning. Nat. Comput. Sci. 2, 779–780 (2022).

    Article 

    Google Scholar 

  • Krenn, M., Landgraf, J., Foesel, T. & Marquardt, F. Artificial intelligence and machine learning for quantum technologies. Phys. Rev. A 107, 010101 (2022).

  • van der Schaar, M. et al. How artificial intelligence and machine learning can help healthcare systems respond to COVID-19. Mach. Learn 110, 1–14 (2021).

    Article 
    MathSciNet 

    Google Scholar 

  • Zhang, T. et al. AI for global climate cooperation: modeling global climate negotiations, agreements, and long-term cooperation in RICE-N. arXiv preprint arXiv:2208.07004; (2022).

  • Pion-Tonachini, L. et al. Learning from learning machines: a new generation of AI technology to meet the needs of science. (2021).

  • Keith, J. A. et al. Combining machine learning and computational chemistry for predictive insights into chemical systems. Chem. Rev. 121, 9816–9872, (2021).

    Article 

    Google Scholar 

  • Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 5, 277–280 (2023).

    Article 

    Google Scholar 

  • Georgescu, I. How machines could teach physicists new scientific concepts. Nat. Rev. Phys. 4, 736–738 (2022).

    Article 

    Google Scholar 

  • Noé, F., Tkatchenko, A., Müller, K.-R. & Clementi, C. Machine learning for molecular simulation. Annu. Rev. Phys. Chem. (2020).

    Article 

    Google Scholar 

  • Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    Article 

    Google Scholar 

  • Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).

    Article 

    Google Scholar 

  • Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. (2022).

    Article 

    Google Scholar 

  • Topol, E. J. Welcoming new guidelines for AI clinical research. Nat. Med. 26, 1318–1320 (2020).

    Article 

    Google Scholar 

  • Topol, E. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. (Hachette UK, 2019).

  • Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. (2022).

    Article 

    Google Scholar 

  • Stoyanovich, J., Bavel, J. J., Van & West, T. V. The imperative of interpretable machines. Nat. Mach. Intell. 2, 197–199 (2020).

    Article 

    Google Scholar 

  • Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 6, 120 (2023).

  • Willcox, K. E., Ghattas, O. & Heimbach, P. The imperative of physics-based modeling and inverse theory in computational science. Nat. Comput. Sci. 1, 166–168 (2021).

    Article 

    Google Scholar 

  • Webster, P. Six ways large language models are changing healthcare. Nat. Med. 29, 2969–2971 (2023).

    Article 

    Google Scholar 

  • Eriksen, A. V., Möller, S. & Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 1 (2023).

  • Ishmam, M. F., Shovon, M. S. H. & Dey, N. From image to language: a critical analysis of visual question answering (VQA) approaches, challenges, and opportunities. Inf. Fusion 106, 102270 (2024).

  • Li, C. et al. Multimodal foundation models: from specialists to general-purpose assistants. Found. Trends Comput. Graph. Vis. 16, 1–214 (2024).

  • Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann. Intern. Med. 177, 210–220 (2024).

  • Raghu, M. & Schmidt, E. A survey of deep learning for scientific discovery. arXiv preprint arXiv:2003.11755 (2020).

  • Song, S. et al. DeepSpeed4Science initiative: enabling large-scale scientific discovery through sophisticated AI system technologies. In NeurIPS 2023 AI for Science Workshop (2023).

  • McCarthy, J., M. M., R. N., & S. C. E. Dartmouth summer research project on artificial intelligence. In Dartmouth Summer Research Project on Artificial Intelligence (1956).

  • Ramos, M. C., Collison, C. J. & White, A. D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 16, 2514–2572 (2025).

  • Zenil, H. & King, R. Artificial intelligence in scientific discovery: Challenges and opportunities. In Science: Challenges, Opportunities and the Future of Research (OECD Publishing, 2023).

  • Zenil, H. & King, R. A framework for evaluating the AI-driven automation of science. In Science: Challenges, Opportunities and the Future of Research (OECD Publishing, 2023).

  • Zenil, H. & King, R. The Far Future of AI in Scientific Discovery. In AI For Science (eds Choudhary F. & Hey, T.) (World Scientific, 2023).

  • Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process Syst. 35 (2022).

  • Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process Syst. 33, 9459–9474 (2020).

    Google Scholar 

  • White, J. et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint arXiv:2302.11382; (2023).

  • Schulhoff, S. et al. The prompt report: a systematic survey of prompt engineering techniques. arXiv preprint arXiv:2406.06608; (2025).

  • Fulford, I. & Ng, A. ChatGPT prompt engineering for developers. DeepLearning.AI (2023).

  • Cheng, Y. et al. Exploring large language model based intelligent agents: definitions, methods, and prospects. arXiv preprint arXiv:2401.03428; (2024).

  • Yang, H., Yue, S. & He, Y. Auto-GPT for online decision making: benchmarks and additional opinions. arXiv preprint arXiv:2306.02224; (2023).

  • Nakajima, Y. yoheinakajima/babyagi. Preprint at (2024).

  • Chase, H. LangChain. (2022).

  • Liu, J. LlamaIndex. URL: ‘ (2022).

  • Khattab, O. et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. 12th International Conference on Learning Representations, ICLR 2024 (2023).

  • Yuksekgonul, M. et al. TextGrad: automatic ‘differentiation’ via text. (2024).

  • DeepSeek-AI et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. (2025).

  • Shao, Z. et al. DeepSeekMath: pushing the limits of mathematical reasoning in open language models. (2024).

  • Li, Z.-Z. et al. From system 1 to system 2: a survey of reasoning large language models. (2025).

  • Caufield, J. H. et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40, btae104 (2024).

    Article 

    Google Scholar 

  • Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl. Acad. Sci. USA 120, e2305016120 (2023).

    Article 

    Google Scholar 

  • Liu, Y. et al. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. EMNLP 20232023 Conference on Empirical Methods in Natural Language Processing, Proceedings 2511–2522 (2023)

  • Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. (HEALTH) 3, 1–23 (2021).

    Google Scholar 

  • Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn Sci. Technol. 3, 015022 (2022).

    Article 

    Google Scholar 

  • Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).

    Article 

    Google Scholar 

  • Gottweis, J. et al. Towards an AI co-scientist. (2025).

  • Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).

  • M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 1–11 (2024).

  • Qu, Y. et al. CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments. bioRxiv (2024).

  • Roohani, Y. H. et al. BioDiscoveryAgent: an AI agent for designing genetic perturbation experiments. In Proc. 13th Int. Conf. Learn. Represent. (ICLR); (2025).

  • OpenAI et al. OpenAI o1 System Card. arXiv:2305.14947v2 (2024).

  • OpenAI. Learning to Reason with LLMs. (2024).

  • Kaplan, J. et al. Scaling laws for neural language models. arXiv preprint arXiv:2001. 08361 (2020).

  • Wei, J. et al. Emergent abilities of large language models. Trans. Mach. Learn. Res. (2022).

  • Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science (1979) 386, eado9336 (2024).

  • Brixi, G. et al. Genome modeling and design across all domains of life with Evo 2. bioRxiv (2025).

  • Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21, 1470–1480 (2024).

    Article 

    Google Scholar 

  • Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. (2020).

  • Birk, J., Hallin, A. & Kasieczka, G. OmniJet-α: the first cross-task foundation model for particle physics. Mach. Learn Sci. Technol. 5, 035031 (2024).

  • McCabe, M. et al. Multiple physics pretraining for spatiotemporal surrogate models. In Proc. 38th Annu. Conf. Neural Inf. Process. Syst. (NeurIPS); (2024).

  • Shojaee, P., Meidani, K., Gupta, S., Farimani, A. B. & Reddy, C. K. LLM-SR: scientific equation discovery via programming with large language models. ArXiv abs/2404.18400, (2024).

  • Zhang, S. et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs. NEJM AI 2, AIoa2400640 (2025).

  • Mishra-Sharma, S., Song, Y. & Thaler, J. Paperclip: associating astronomical observations and natural language with multi-modal models. In Proc. 1st Conf. Lang. Model. (2024).

  • Parker, L. et al. AstroCLIP: a cross-modal foundation model for galaxies. Mon. Not. R. Astron Soc. 531, 4990–5011 (2024).

    Article 

    Google Scholar 

  • M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

  • OpenAI et al. GPT-4 Technical Report. (2023).

  • Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 4171–4186 (2018)..

  • Radford, A. et al. Learning transferable visual models from natural language supervision. Proc. Mach. Learn Res. 139, 8748–8763 (2021).

    Google Scholar 

  • Ramesh, A. et al. Zero-shot text-to-image generation. Proc. Mach. Learn Res. 139, 8821–8831 (2021).

    Google Scholar 

  • DeepMind. AlphaProof: AI achieves silver-medal standard solving International Mathematical Olympiad problems. (2024).

  • Zelikman, E., Wu, Y., Mu, J. & Goodman, N. D. STaR: bootstrapping reasoning with reasoning. (2022).

  • Popper, K. Karl Popper: The Logic of Scientific Discovery. (1959).

  • Fan, W. et al. A Survey on RAG Meeting LLMs: towards retrieval-augmented large language models. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 6491–6501, (2024).

  • Li, Y., Xu, M., Miao, X., Zhou, S. & Qian, T. Prompting large language models for counterfactual generation: an empirical study. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 Main Conference Proceedings 13201–13221 (2023).

  • Kiciman, E., Ness, R., Sharma, A. & Tan, C. Causal reasoning and large language models: opening a new frontier for causality. Trans. Mach. Learn. Res. (2024).

  • Jiralerspong, T., Chen, X., More, Y., Shah, V. & Bengio, Y. Efficient causal graph discovery using large language models. arXiv preprint arXiv:2402. 01207 (2024).

  • Schick, T. et al. Toolformer: language models can teach themselves to use tools. Adv. Neural Inf. Process Syst. 36, (2023).

  • Chen, Q., Ho, Y.-J. I., Sun, P. & Wang, D. The Philosopher’s Stone for Science–the catalyst change of ai for scientific creativity. Pin and Wang, Dashun, The Philosopher’s Stone for Science–The Catalyst Change of AI for Scientific Creativity (March 5, 2024) (2024).

  • Zenil, H. et al. An algorithmic information calculus for causal discovery and reprogramming systems. iScience 19, 1160–1172 (2019).

    Article 

    Google Scholar 

  • Pattee, H. H., Rączaszek-Leonardi, J. & Pattee, H. H. Evolving self-reference: matter, symbols, and semantic closure. Laws, Language and Life: Howard Pattee’s classic papers on the physics of symbols with contemporary commentary 211–226 (2012).

  • Singh, C., Morris, J. X., Aneja, J., Rush, A. M. & Gao, J. Explaining patterns in data with language models via interpretable autoprompting. arXiv preprint arXiv:2210. 01848 (2022).

  • Li, P. et al. Table-GPT: table fine-tuned GPT for diverse table tasks. Proc. ACM Manag. Data 2, 176 (2024).

  • Zhan, J. et al. AnyGPT: unified multimodal LLM with discrete sequence modeling. in Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. (ACL) 9637–9662, (2024).

  • Wu, S. et al. Beyond language models: byte models are digital world simulators. arXiv preprint arXiv:2402.19155; (2024).

  • Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).

    Google Scholar 

  • Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput Sci. 28, 31–36 (1988).

    Article 

    Google Scholar 

  • Zhang, S. et al. Intelligence at the edge of chaos. In Proc. 13th Int. Conf. Learn. Represent. (ICLR); (2025).

  • Lyu, C. et al. Large language models as code executors: an exploratory study. arXiv preprint arXiv:2410.06667; (2024).

  • Jiang, J. et al. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515; (2024).

  • Wang, G. et al. Voyager: an open-ended embodied agent with large language models. Trans. Mach. Learn. Res. (2024).

  • Darvish, K. et al. ORGANA: a robotic assistant for automated chemistry experimentation and characterization. (2024).

  • Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents with verbal reinforcement learning. Adv. Neural Inf. Process Syst. 36 (2023).

  • Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proc. 11th Int. Conf. Learn. Represent. (ICLR); (2023).

  • Lavin, A. et al. Simulation intelligence: towards a new generation of scientific methods. ArXiv abs/2112.03235, (2021).

  • Park, J. S. et al. Generative agents: interactive simulacra of human behavior. UIST ’23: Proc. 36th Annual ACM Symposium on User Interface Software and Technology (2023).

    Article 

    Google Scholar 

  • Singh, C. et al. Explaining black box text modules in natural language with language models. arXiv preprint arXiv:2305. 09863 (2023).

  • Bills, S. et al. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html. (Date accessed: 14. 05. 2023) (2023).

  • Yin, Y., Wang, Y., Evans, J. A. & Wang, D. Quantifying the dynamics of failure across science, startups and security. Nature 575, 190–194 (2019).

    Article 

    Google Scholar 

  • Kauffman, S. A. Investigations. (Oxford University Press, 2000).

  • Burger, B. et al. A mobile robotic chemist. Nature 583, 237–241 (2020). 2020 583:7815.

    Article 

    Google Scholar 

  • Yang, C. et al. Large language models as optimizers. In Proc. 12th Int. Conf. Learn. Represent. (ICLR); (2024).

  • Wang, Q., Downey, D., Ji, H. & Hope, T. SciMON: scientific inspiration machines optimized for novelty. In Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. (ACL) 279–299 (2024).

  • Ellis, K. et al. DreamCoder: growing generalizable, interpretable knowledge with wake–sleep Bayesian program learning. Philosophical Transactions of the Royal Society A 381 (2020).

  • Hazan, H. & Levin, M. Exploring the behavior of bioelectric circuits using evolution heuristic search. Bioelectricity 4, 207–227 (2022).

    Article 

    Google Scholar 

  • Fleming, L. Recombinant uncertainty in technological search. Manag. Sci. 47, 117–132 (2001).

    Article 

    Google Scholar 

  • Weitzman, M. Optimal Search for the Best Alternative. vol. 78 (Department of Energy, 1978).

  • Van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R. & Bockting, C. L. ChatGPT: five priorities for research. Nature 614, 224–226 (2023).

    Article 

    Google Scholar 

  • Edwards, B. Why ChatGPT and Bing Chat are so good at making things up. Ars Technica (2023).

  • DeepMind. Shaking the foundations: delusions in sequence models for interaction and control. www.deepmind.com (2023).

  • Wang, R. et al. Hypothesis search: inductive reasoning with language models. In Proc. 12th Int. Conf. Learn. Represent. (ICLR) (2024).

  • Lavin, A. et al. Technology readiness levels for machine learning systems. Nat. Commun. 13, (2020).

  • Zhou, J. P. et al. Don’t Trust: Verify—Grounding LLM Quantitative Reasoning with Autoformalization. 12th International Conference on Learning Representations, ICLR 2024 (2024).

  • Olausson, T. X. et al. LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers. EMNLP 2023 – 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings 5153–5176 (2023).

  • Wang, R. et al. Hypothesis Search: Inductive Reasoning with Language Models. 12th International Conference on Learning Representations, ICLR 2024 (2023).

  • Chollet, F. On the measure of intelligence. arXiv preprint arXiv:1911.01547; (2019).

  • Abdel-Rehim, A. et al. Scientific hypothesis generation by a large language model: laboratory validation in breast cancer treatment. R. Soc. Interface 22, 20240674 (2025).

    Article 

    Google Scholar 

  • Zhao, Y. et al. Assessing and understanding creativity in large language models. arXiv:2401.12491 [cs.CL] (2024).

  • Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation. bioRxiv (2024). 2024.11.11.623004. .

  • Girotra, K., Meincke, L., Terwiesch, C. & Ulrich, K. T. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071 (2023).

  • Qi, B. et al. Large Language Models are Zero Shot Hypothesis Proposers. arXiv preprint arXiv:2311. 05965 (2023).

  • Singh, C. et al. Explaining black box text modules in natural language with language models. (2023).

  • Liu, T. J. B., Boulle, N., Sarfati, R. & Earls, C. LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law. In Proc. 2024 Conf. Empir. Methods Nat. Lang. Process. (EMNLP) 15097–15117; (2024).

  • Delétang, G. et al. Language Modeling Is Compression. 12th International Conference on Learning Representations, ICLR 2024 (2023).

  • Huang, Y., Zhang, J., Shan, Z. & He, J. Compression represents intelligence linearly. (2024).

  • Shao, Y. et al. Assisting in writing wikipedia-like articles from scratch with large language models. In Proc. 2024 Conference of of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 6252–6278 (2024)..

  • OpenAI. Introducing deep research. (2025).

  • Perplexity AI. Introducing Perplexity Deep Research. (2025).

  • Cai, Z., Chang, B. & Han, W. Human-in-the-loop through chain-of-thought. arXiv preprint arXiv:2306.07932; (2023).

  • Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput Surv. 55, 1–38 (2022).

    Article 

    Google Scholar 

  • Matsakis, L. Artificial Intelligence May Not ‘Hallucinate’ After All. Wired (2019).

  • Gilmer, J. & Hendrycks, D. A Discussion of ‘adversarial examples are not bugs, they are features’: adversarial example researchers need to expand what is meant by ‘robustness’. Distill 4, 00019.1 (2019).

  • Sahoo, P. et al. A comprehensive survey of hallucination in large language, image, video and audio foundation models. in Findings Assoc. Comput. Linguist.: EMNLP 2024 11709–11724; 10.18653/v1/2024.findings-emnlp.685 (2024).

  • Varshney, N. et al. A stitch in time saves nine: detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987; (2023).

  • Li, J., Zhang, Q., Yu, Y., Fu, Q. & Ye, D. More agents is all you need. Trans. Mach. Learn. Res. (2024).

  • Mündler, N., He, J., Jenko, S. & Vechev, M. Self-contradictory hallucinations of large language models: evaluation, detection and mitigation. In Proc. 12th Int. Conf. Learn. Represent. (ICLR); (2024).

  • Dhuliawala, S. et al. Chain-of-Verification Reduces Hallucination in Large Language Models. Findings of the Association for Computational Linguistics ACL 2024 3563–3578 (2024).

  • Gao, L. et al. RARR: Researching and Revising What Language Models Say, Using Language Models. Proc. Annu. Meet. Assoc. Comput.l Linguist. 1, 16477–16508 (2023).

    Google Scholar 

  • Zhu, C., Xu, B., Wang, Q., Zhang, Y. & Mao, Z. On the calibration of large language models and alignment. (2022).

  • Li, J., Cheng, X., Zhao, W. X., Nie, J. Y. & Wen, J. R. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. EMNLP 2023 – 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings 6449–6464 (2023).

  • Lin, S., Hilton, J. & Evans, O. TruthfulQA: measuring how models mimic human falsehoods. Proc. Annu. Meet. Assoc. Comput. Linguist. 1, 3214–3252 (2021).

    Google Scholar 

  • Min, S. et al. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023 – 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings 12076–12100 (2023).

  • Liu, Y. et al. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. (2023).

  • Li, M. et al. Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection. (2024).

  • Ye, Q., Fu, H. Y., Ren, X. & Jia, R. How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench. Findings of the Association for Computational Linguistics: EMNLP 2023 7493–7517 (2023).

  • Berglund, L. et al. The reversal curse: LLMs trained on “A is B” fail to learn “B is A”. In Proc. 12th Int. Conf. Learn. Represent. (ICLR) (2024).

  • Chen, X., Chi, R. A., Wang, X. & Zhou, D. Premise order matters in reasoning with large language models. Proc. Mach. Learn Res. 235, 6596–6620 (2024).

    Google Scholar 

  • Allen-Zhu, Z. & Li, Y. Physics of language models: part 3.2, knowledge manipulation. In Proc. 13th Int. Conf. Learn. Represent. (ICLR); (2025).

  • Nezhurina, M., Cipolina-Kun, L., Cherti, M. & Jitsev, J. Alice in Wonderland: simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061; (2025).

  • Dziri, N. et al. Faith and Fate: Limits of Transformers on Compositionality. Adv Neural Inf. Process Syst. 36, (2023).

  • Delétang, G. et al. Neural Networks and the Chomsky Hierarchy. 11th International Conference on Learning Representations, ICLR 2023 (2022).

  • McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D. & Griffiths, T. L. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proc. Natl. Acad. Sci. USA 121, e2322420121 (2024).

    Article 

    Google Scholar 

  • Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. Proc. 2024 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol., NAACL 2024 1, 1819–1862 (2024).

    Google Scholar 

  • Kambhampati, S. et al. LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks. (2024).

  • Pallagani, V. et al. On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS). Proc. Int. Conf. Automated Plan. Sched. 34, 432–444 (2024).

    Google Scholar 

  • Huang, J. et al. Large Language Models Cannot Self-Correct Reasoning Yet. 12th International Conference on Learning Representations, ICLR 2024 (2023).

  • Wang, X. et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. 11th International Conference on Learning Representations, ICLR 2023 (2022).

  • Zhou, K., Hwang, J., Ren, X. & Sap, M. Relying on the unreliable: the impact of language models’ reluctance to express uncertainty. In Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. (ACL) 3623–3643; (2024).

  • Du, Y., Li, S., Torralba, A., Tenenbaum, J. B. & Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. Proc. Mach. Learn Res 235, 11733–11763 (2023).

    Google Scholar 

  • Golovneva, O., Allen-Zhu, Z., Weston, J. E. & Sukhbaatar, S. Reverse training to nurse the reversal curse. In Proc. 1st Conf. Lang. Model. (2024).

  • interpreting GPT: the logit lens—LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.

  • Zou, A. et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310. 01405 (2023).

  • Chen, Y. et al. Do models explain themselves? counterfactual simulatability of natural language explanations. In Proc. 41st Int. Conf. Mach. Learn. (ICML) 7880–7904; (2024).

  • Wiegreffe, S. & Pinter, Y. Attention is not not Explanation. EMNLP-IJCNLP 2019 – 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference 11–20 (2019).

  • Jain, S. & Wallace, B. C. Attention is not explanation. In Proc. 2019 Conference of the North 3543–3556 (2019).

  • Zhang, Y. et al. Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief. Bioinform. 25, 1–22 (2023).

    Article 

    Google Scholar 

  • Narayanan, S. et al. Aviary: training language agents on challenging scientific tasks. (2024).

  • Taragin, M. I. Learning from negative findings. Isr. J. Health Policy Res 8, 1–4 (2019).

    Article 

    Google Scholar 

  • Bik, E. M. Publishing negative results is good for science. Access Microbiol 6, 000792 (2024).

    Article 

    Google Scholar 

  • Echevarriá, L., Malerba, A. & Arechavala-Gomeza, V. Researcher’s Perceptions on Publishing “Negative” Results and Open Access. Nucleic Acid Ther. 31, 185 (2021).

    Article 

    Google Scholar 

  • Gray, A. ChatGPT ‘contamination’: estimating the prevalence of LLMs in the scholarly literature. (2024).

  • Liang, W. et al. Mapping the Increasing Use of LLMs in Scientific Papers. (2024).

  • Latona, G. R., Ribeiro, M. H., Davidson, T. R., Veselovsky, V. & West, R. The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates. (2024).

  • Liang, W. et al. Monitoring AI-modified content at scale: a case study on the impact of ChatGPT on AI conference peer reviews. in Proc. 41st Int. Conf. Mach. Learn. (ICML) 29575–29620; (2024).

  • Liang, W. et al. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. NEJM AI 1, (2024).

  • Thelwall, M. Can ChatGPT evaluate research quality? J. Data Inf. Sci. 9, 1–21 (2024).

    Google Scholar 

  • Meyer, J. G. et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 16, 20 (2023).

    Article 

    Google Scholar 

  • Yanai, I. & Lercher, M. Night science. Genome Biol. 20, 1–3 (2019).

    Article 

    Google Scholar 

  • Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). 2016 529:7587.

    Article 

    Google Scholar 

  • Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science (1979) 362, 1140–1144 (2018).

    MathSciNet 

    Google Scholar 

  • Zhao, A. et al. Absolute Zero: reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335; (2025).

  • Tegnér, J. N. et al. Computational disease modeling—Fact or fiction? BMC Syst. Biol. 3, 1–3 (2009).

    Article 

    Google Scholar 

  • Petroni, F. et al. Language Models as Knowledge Bases? EMNLP-IJCNLP 20192019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference 2463–2473, (2019).

  • Lee, M. A mathematical investigation of hallucination and creativity in gpt models. Mathematics 11, 2320 (2023).

    Article 

    Google Scholar 

  • Chen, M. et al. Evaluating Large Language Models Trained on Code. (2021).

  • Peng, B. et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302. 12813 (2023).

  • Luo, L., Li, Y.-F., Haffari, G. & Pan, S. Reasoning on graphs: faithful and interpretable large language model reasoning. (2023).

  • Ji, Z. et al. Towards Mitigating LLM Hallucination via Self Reflection. 1827–1843, (2023).

  • Yao, S. et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Adv. Neural Inf. Process Syst. 36 (2023).

  • Ye, X. & Durrett, G. The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning. Adv. Neural Inf. Process Syst. 35 (2022).

  • Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L. & Geva, M. Patchscopes: a unifying framework for inspecting hidden representations of language models. Proc. Mach. Learn Res. 235, 15466–15490 (2024).

    Google Scholar 

  • Hyland, K. Academic publishing and the myth of linguistic injustice. J. Second Lang. Writ. 31, 58–69 (2016).

    Article 

    Google Scholar 

  • Clavero, M. “Awkward wording. Rephrase”: linguistic injustice in ecological journals. (2010).

  • Strauss, P. Shakespeare and the English poets: the influence of native speaking English reviewers on the acceptance of journal articles. Publications 7, 20 (2019).

    Article 

    Google Scholar 

  • link

    Leave a Reply

    Your email address will not be published. Required fields are marked *