Exploring the role of large language models in the scientific method: from hypothesis to discovery

Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).

Article

Google Scholar

Alkhateeb, A. & Aeon. Can Scientific Discovery Be Automated? The Atlanctic (2017).

Jain, M. et al. GFlowNets for AI-driven scientific discovery. Digit. Discov. 2, 557–577 (2023).

Article

Google Scholar

Kitano, H. Nobel turing challenge: creating the engine for scientific discovery. NPJ Syst. Biol. Appl. 7, 29 (2021).

Article

Google Scholar

Cornelio, C. et al. Combining data and theory for derivable scientific discovery with AI-Descartes. Nat. Commun. 14, 1–10 (2023). 2023 14:1.

Article

Google Scholar

Kim, S. et al. Integration of neural network-based symbolic regression in deep learning for scientific discovery. IEEE Trans. Neural Netw. Learn Syst. 32, 4166–4177 (2021).

Article

Google Scholar

Gil, Y., Greaves, M., Hendler, J. & Hirsh, H. Amplify scientific discovery with artificial intelligence: Many human activities are a bottleneck in progress. Science (1979) 346, 171–172 (2014).

Google Scholar

Kitano, H. Artificial Intelligence to Win the Nobel Prize and Beyond: Creating the Engine for Scientific Discovery. AI Mag. 37, 39–49 (2016).

Google Scholar

Berens, P., Cranmer, K., Lawrence, N. D., von Luxburg, U. & Montgomery, J. AI for science: an emerging agenda. arXiv preprint arXiv:2303.04217; (2023).

Li, Z., Ji, J. & Zhang, Y. From Kepler to Newton: explainable AI for science. arXiv preprint arXiv:2111.12210; (2021).

Baker, N. et al. Workshop report on basic research needs for scientific machine learning: core technologies for artificial intelligence. (2019).

Manta, C. D., Hu, E. & Bengio, Y. GFlowNets for causal discovery: an overview. OpenReview (ICML, 2023).

Vinuesa, R., Brunton, S. L. & McKeon, B. J. The transformative potential of machine learning for experiments in fluid mechanics. Nat. Rev. Phys. 5, 536–545 (2023).

Article

Google Scholar

del Rosario, Z. & del Rosario, M. Synthesizing domain science with machine learning. Nat. Comput. Sci. 2, 779–780 (2022).

Article

Google Scholar

Krenn, M., Landgraf, J., Foesel, T. & Marquardt, F. Artificial intelligence and machine learning for quantum technologies. Phys. Rev. A 107, 010101 (2022).

van der Schaar, M. et al. How artificial intelligence and machine learning can help healthcare systems respond to COVID-19. Mach. Learn 110, 1–14 (2021).

Article
MathSciNet

Google Scholar

Zhang, T. et al. AI for global climate cooperation: modeling global climate negotiations, agreements, and long-term cooperation in RICE-N. arXiv preprint arXiv:2208.07004; (2022).

Pion-Tonachini, L. et al. Learning from learning machines: a new generation of AI technology to meet the needs of science. (2021).

Keith, J. A. et al. Combining machine learning and computational chemistry for predictive insights into chemical systems. Chem. Rev. 121, 9816–9872, (2021).

Article

Google Scholar

Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 5, 277–280 (2023).

Article

Google Scholar

Georgescu, I. How machines could teach physicists new scientific concepts. Nat. Rev. Phys. 4, 736–738 (2022).

Article

Google Scholar

Noé, F., Tkatchenko, A., Müller, K.-R. & Clementi, C. Machine learning for molecular simulation. Annu. Rev. Phys. Chem. (2020).

Article

Google Scholar

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

Article

Google Scholar

Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).

Article

Google Scholar

Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. (2022).

Article

Google Scholar

Topol, E. J. Welcoming new guidelines for AI clinical research. Nat. Med. 26, 1318–1320 (2020).

Article

Google Scholar

Topol, E. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. (Hachette UK, 2019).

Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. (2022).

Article

Google Scholar

Stoyanovich, J., Bavel, J. J., Van & West, T. V. The imperative of interpretable machines. Nat. Mach. Intell. 2, 197–199 (2020).

Article

Google Scholar

Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 6, 120 (2023).

Willcox, K. E., Ghattas, O. & Heimbach, P. The imperative of physics-based modeling and inverse theory in computational science. Nat. Comput. Sci. 1, 166–168 (2021).

Article

Google Scholar

Webster, P. Six ways large language models are changing healthcare. Nat. Med. 29, 2969–2971 (2023).

Article

Google Scholar

Eriksen, A. V., Möller, S. & Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 1 (2023).

Ishmam, M. F., Shovon, M. S. H. & Dey, N. From image to language: a critical analysis of visual question answering (VQA) approaches, challenges, and opportunities. Inf. Fusion 106, 102270 (2024).

Li, C. et al. Multimodal foundation models: from specialists to general-purpose assistants. Found. Trends Comput. Graph. Vis. 16, 1–214 (2024).

Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann. Intern. Med. 177, 210–220 (2024).

Raghu, M. & Schmidt, E. A survey of deep learning for scientific discovery. arXiv preprint arXiv:2003.11755 (2020).

Song, S. et al. DeepSpeed4Science initiative: enabling large-scale scientific discovery through sophisticated AI system technologies. In NeurIPS 2023 AI for Science Workshop (2023).

McCarthy, J., M. M., R. N., & S. C. E. Dartmouth summer research project on artificial intelligence. In Dartmouth Summer Research Project on Artificial Intelligence (1956).

Ramos, M. C., Collison, C. J. & White, A. D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 16, 2514–2572 (2025).

Zenil, H. & King, R. Artificial intelligence in scientific discovery: Challenges and opportunities. In Science: Challenges, Opportunities and the Future of Research (OECD Publishing, 2023).

Zenil, H. & King, R. A framework for evaluating the AI-driven automation of science. In Science: Challenges, Opportunities and the Future of Research (OECD Publishing, 2023).

Zenil, H. & King, R. The Far Future of AI in Scientific Discovery. In AI For Science (eds Choudhary F. & Hey, T.) (World Scientific, 2023).

Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process Syst. 35 (2022).

Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process Syst. 33, 9459–9474 (2020).

Google Scholar

White, J. et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint arXiv:2302.11382; (2023).

Schulhoff, S. et al. The prompt report: a systematic survey of prompt engineering techniques. arXiv preprint arXiv:2406.06608; (2025).

Fulford, I. & Ng, A. ChatGPT prompt engineering for developers. DeepLearning.AI (2023).

Cheng, Y. et al. Exploring large language model based intelligent agents: definitions, methods, and prospects. arXiv preprint arXiv:2401.03428; (2024).

Yang, H., Yue, S. & He, Y. Auto-GPT for online decision making: benchmarks and additional opinions. arXiv preprint arXiv:2306.02224; (2023).

Nakajima, Y. yoheinakajima/babyagi. Preprint at (2024).

Chase, H. LangChain. (2022).

Liu, J. LlamaIndex. URL: ‘ (2022).

Khattab, O. et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. 12th International Conference on Learning Representations, ICLR 2024 (2023).

Yuksekgonul, M. et al. TextGrad: automatic ‘differentiation’ via text. (2024).

DeepSeek-AI et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. (2025).

Shao, Z. et al. DeepSeekMath: pushing the limits of mathematical reasoning in open language models. (2024).

Li, Z.-Z. et al. From system 1 to system 2: a survey of reasoning large language models. (2025).

Caufield, J. H. et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40, btae104 (2024).

Article

Google Scholar

Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl. Acad. Sci. USA 120, e2305016120 (2023).

Article

Google Scholar

Liu, Y. et al. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Proceedings 2511–2522 (2023)

Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. (HEALTH) 3, 1–23 (2021).

Google Scholar

Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn Sci. Technol. 3, 015022 (2022).

Article

Google Scholar

Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).

Article

Google Scholar

Gottweis, J. et al. Towards an AI co-scientist. (2025).

Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).

M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 1–11 (2024).

Qu, Y. et al. CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments. bioRxiv (2024).

Roohani, Y. H. et al. BioDiscoveryAgent: an AI agent for designing genetic perturbation experiments. In Proc. 13th Int. Conf. Learn. Represent. (ICLR); (2025).

OpenAI et al. OpenAI o1 System Card. arXiv:2305.14947v2 (2024).

OpenAI. Learning to Reason with LLMs. (2024).

Kaplan, J. et al. Scaling laws for neural language models. arXiv preprint arXiv:2001. 08361 (2020).

Wei, J. et al. Emergent abilities of large language models. Trans. Mach. Learn. Res. (2022).

Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science (1979) 386, eado9336 (2024).

Brixi, G. et al. Genome modeling and design across all domains of life with Evo 2. bioRxiv (2025).

Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21, 1470–1480 (2024).

Article

Google Scholar

Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. (2020).

Birk, J., Hallin, A. & Kasieczka, G. OmniJet-α: the first cross-task foundation model for particle physics. Mach. Learn Sci. Technol. 5, 035031 (2024).

McCabe, M. et al. Multiple physics pretraining for spatiotemporal surrogate models. In Proc. 38th Annu. Conf. Neural Inf. Process. Syst. (NeurIPS); (2024).

Shojaee, P., Meidani, K., Gupta, S., Farimani, A. B. & Reddy, C. K. LLM-SR: scientific equation discovery via programming with large language models. ArXiv abs/2404.18400, (2024).

Zhang, S. et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs. NEJM AI 2, AIoa2400640 (2025).

Mishra-Sharma, S., Song, Y. & Thaler, J. Paperclip: associating astronomical observations and natural language with multi-modal models. In Proc. 1st Conf. Lang. Model. (2024).

Parker, L. et al. AstroCLIP: a cross-modal foundation model for galaxies. Mon. Not. R. Astron Soc. 531, 4990–5011 (2024).

Article

Google Scholar

M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

OpenAI et al. GPT-4 Technical Report. (2023).

Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 4171–4186 (2018)..

Radford, A. et al. Learning transferable visual models from natural language supervision. Proc. Mach. Learn Res. 139, 8748–8763 (2021).

Google Scholar

Ramesh, A. et al. Zero-shot text-to-image generation. Proc. Mach. Learn Res. 139, 8821–8831 (2021).

Google Scholar

DeepMind. AlphaProof: AI achieves silver-medal standard solving International Mathematical Olympiad problems. (2024).

Zelikman, E., Wu, Y., Mu, J. & Goodman, N. D. STaR: bootstrapping reasoning with reasoning. (2022).

Popper, K. Karl Popper: The Logic of Scientific Discovery. (1959).

Fan, W. et al. A Survey on RAG Meeting LLMs: towards retrieval-augmented large language models. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 6491–6501, (2024).

Li, Y., Xu, M., Miao, X., Zhou, S. & Qian, T. Prompting large language models for counterfactual generation: an empirical study. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 Main Conference Proceedings 13201–13221 (2023).

Kiciman, E., Ness, R., Sharma, A. & Tan, C. Causal reasoning and large language models: opening a new frontier for causality. Trans. Mach. Learn. Res. (2024).

Jiralerspong, T., Chen, X., More, Y., Shah, V. & Bengio, Y. Efficient causal graph discovery using large language models. arXiv preprint arXiv:2402. 01207 (2024).

Schick, T. et al. Toolformer: language models can teach themselves to use tools. Adv. Neural Inf. Process Syst. 36, (2023).

Chen, Q., Ho, Y.-J. I., Sun, P. & Wang, D. The Philosopher’s Stone for Science–the catalyst change of ai for scientific creativity. Pin and Wang, Dashun, The Philosopher’s Stone for Science–The Catalyst Change of AI for Scientific Creativity (March 5, 2024) (2024).

Zenil, H. et al. An algorithmic information calculus for causal discovery and reprogramming systems. iScience 19, 1160–1172 (2019).

Article

Google Scholar

Pattee, H. H., Rączaszek-Leonardi, J. & Pattee, H. H. Evolving self-reference: matter, symbols, and semantic closure. Laws, Language and Life: Howard Pattee’s classic papers on the physics of symbols with contemporary commentary 211–226 (2012).

Singh, C., Morris, J. X., Aneja, J., Rush, A. M. & Gao, J. Explaining patterns in data with language models via interpretable autoprompting. arXiv preprint arXiv:2210. 01848 (2022).

Li, P. et al. Table-GPT: table fine-tuned GPT for diverse table tasks. Proc. ACM Manag. Data 2, 176 (2024).

Zhan, J. et al. AnyGPT: unified multimodal LLM with discrete sequence modeling. in Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. (ACL) 9637–9662, (2024).

Wu, S. et al. Beyond language models: byte models are digital world simulators. arXiv preprint arXiv:2402.19155; (2024).

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).

Google Scholar

Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput Sci. 28, 31–36 (1988).

Article

Google Scholar

Zhang, S. et al. Intelligence at the edge of chaos. In Proc. 13th Int. Conf. Learn. Represent. (ICLR); (2025).

Lyu, C. et al. Large language models as code executors: an exploratory study. arXiv preprint arXiv:2410.06667; (2024).

Jiang, J. et al. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515; (2024).

Wang, G. et al. Voyager: an open-ended embodied agent with large language models. Trans. Mach. Learn. Res. (2024).

Darvish, K. et al. ORGANA: a robotic assistant for automated chemistry experimentation and characterization. (2024).

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents with verbal reinforcement learning. Adv. Neural Inf. Process Syst. 36 (2023).

Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proc. 11th Int. Conf. Learn. Represent. (ICLR); (2023).

Lavin, A. et al. Simulation intelligence: towards a new generation of scientific methods. ArXiv abs/2112.03235, (2021).

Park, J. S. et al. Generative agents: interactive simulacra of human behavior. UIST ’23: Proc. 36th Annual ACM Symposium on User Interface Software and Technology (2023).

Article

Google Scholar

Singh, C. et al. Explaining black box text modules in natural language with language models. arXiv preprint arXiv:2305. 09863 (2023).

Bills, S. et al. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html. (Date accessed: 14. 05. 2023) (2023).

Yin, Y., Wang, Y., Evans, J. A. & Wang, D. Quantifying the dynamics of failure across science, startups and security. Nature 575, 190–194 (2019).

Article

Google Scholar

Kauffman, S. A. Investigations. (Oxford University Press, 2000).

Burger, B. et al. A mobile robotic chemist. Nature 583, 237–241 (2020). 2020 583:7815.

Article

Google Scholar

Yang, C. et al. Large language models as optimizers. In Proc. 12th Int. Conf. Learn. Represent. (ICLR); (2024).

Wang, Q., Downey, D., Ji, H. & Hope, T. SciMON: scientific inspiration machines optimized for novelty. In Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. (ACL) 279–299 (2024).

Ellis, K. et al. DreamCoder: growing generalizable, interpretable knowledge with wake–sleep Bayesian program learning. Philosophical Transactions of the Royal Society A 381 (2020).

Hazan, H. & Levin, M. Exploring the behavior of bioelectric circuits using evolution heuristic search. Bioelectricity 4, 207–227 (2022).

Article

Google Scholar

Fleming, L. Recombinant uncertainty in technological search. Manag. Sci. 47, 117–132 (2001).

Article

Google Scholar

Weitzman, M. Optimal Search for the Best Alternative. vol. 78 (Department of Energy, 1978).

Van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R. & Bockting, C. L. ChatGPT: five priorities for research. Nature 614, 224–226 (2023).

Article

Google Scholar

Edwards, B. Why ChatGPT and Bing Chat are so good at making things up. Ars Technica (2023).

DeepMind. Shaking the foundations: delusions in sequence models for interaction and control. www.deepmind.com (2023).

Wang, R. et al. Hypothesis search: inductive reasoning with language models. In Proc. 12th Int. Conf. Learn. Represent. (ICLR) (2024).

Lavin, A. et al. Technology readiness levels for machine learning systems. Nat. Commun. 13, (2020).

Zhou, J. P. et al. Don’t Trust: Verify—Grounding LLM Quantitative Reasoning with Autoformalization. 12th International Conference on Learning Representations, ICLR 2024 (2024).

Olausson, T. X. et al. LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers. EMNLP 2023 – 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings 5153–5176 (2023).

Wang, R. et al. Hypothesis Search: Inductive Reasoning with Language Models. 12th International Conference on Learning Representations, ICLR 2024 (2023).

Chollet, F. On the measure of intelligence. arXiv preprint arXiv:1911.01547; (2019).

Abdel-Rehim, A. et al. Scientific hypothesis generation by a large language model: laboratory validation in breast cancer treatment. R. Soc. Interface 22, 20240674 (2025).

Article

Google Scholar

Zhao, Y. et al. Assessing and understanding creativity in large language models. arXiv:2401.12491 [cs.CL] (2024).

Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation. bioRxiv (2024). 2024.11.11.623004. .

Girotra, K., Meincke, L., Terwiesch, C. & Ulrich, K. T. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071 (2023).

Qi, B. et al. Large Language Models are Zero Shot Hypothesis Proposers. arXiv preprint arXiv:2311. 05965 (2023).

Singh, C. et al. Explaining black box text modules in natural language with language models. (2023).

Liu, T. J. B., Boulle, N., Sarfati, R. & Earls, C. LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law. In Proc. 2024 Conf. Empir. Methods Nat. Lang. Process. (EMNLP) 15097–15117; (2024).

Delétang, G. et al. Language Modeling Is Compression. 12th International Conference on Learning Representations, ICLR 2024 (2023).

Huang, Y., Zhang, J., Shan, Z. & He, J. Compression represents intelligence linearly. (2024).

Shao, Y. et al. Assisting in writing wikipedia-like articles from scratch with large language models. In Proc. 2024 Conference of of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 6252–6278 (2024)..

OpenAI. Introducing deep research. (2025).

Perplexity AI. Introducing Perplexity Deep Research. (2025).

Cai, Z., Chang, B. & Han, W. Human-in-the-loop through chain-of-thought. arXiv preprint arXiv:2306.07932; (2023).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput Surv. 55, 1–38 (2022).

Article

Google Scholar

Matsakis, L. Artificial Intelligence May Not ‘Hallucinate’ After All. Wired (2019).

Gilmer, J. & Hendrycks, D. A Discussion of ‘adversarial examples are not bugs, they are features’: adversarial example researchers need to expand what is meant by ‘robustness’. Distill 4, 00019.1 (2019).

Sahoo, P. et al. A comprehensive survey of hallucination in large language, image, video and audio foundation models. in Findings Assoc. Comput. Linguist.: EMNLP 2024 11709–11724; 10.18653/v1/2024.findings-emnlp.685 (2024).

Varshney, N. et al. A stitch in time saves nine: detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987; (2023).

Li, J., Zhang, Q., Yu, Y., Fu, Q. & Ye, D. More agents is all you need. Trans. Mach. Learn. Res. (2024).

Mündler, N., He, J., Jenko, S. & Vechev, M. Self-contradictory hallucinations of large language models: evaluation, detection and mitigation. In Proc. 12th Int. Conf. Learn. Represent. (ICLR); (2024).

Dhuliawala, S. et al. Chain-of-Verification Reduces Hallucination in Large Language Models. Findings of the Association for Computational Linguistics ACL 2024 3563–3578 (2024).

Gao, L. et al. RARR: Researching and Revising What Language Models Say, Using Language Models. Proc. Annu. Meet. Assoc. Comput.l Linguist. 1, 16477–16508 (2023).

Google Scholar

Zhu, C., Xu, B., Wang, Q., Zhang, Y. & Mao, Z. On the calibration of large language models and alignment. (2022).

Li, J., Cheng, X., Zhao, W. X., Nie, J. Y. & Wen, J. R. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. EMNLP 2023 – 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings 6449–6464 (2023).

Lin, S., Hilton, J. & Evans, O. TruthfulQA: measuring how models mimic human falsehoods. Proc. Annu. Meet. Assoc. Comput. Linguist. 1, 3214–3252 (2021).

Google Scholar

Min, S. et al. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023 – 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings 12076–12100 (2023).

Liu, Y. et al. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. (2023).

Li, M. et al. Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection. (2024).

Ye, Q., Fu, H. Y., Ren, X. & Jia, R. How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench. Findings of the Association for Computational Linguistics: EMNLP 2023 7493–7517 (2023).

Berglund, L. et al. The reversal curse: LLMs trained on “A is B” fail to learn “B is A”. In Proc. 12th Int. Conf. Learn. Represent. (ICLR) (2024).

Chen, X., Chi, R. A., Wang, X. & Zhou, D. Premise order matters in reasoning with large language models. Proc. Mach. Learn Res. 235, 6596–6620 (2024).

Google Scholar

Allen-Zhu, Z. & Li, Y. Physics of language models: part 3.2, knowledge manipulation. In Proc. 13th Int. Conf. Learn. Represent. (ICLR); (2025).

Nezhurina, M., Cipolina-Kun, L., Cherti, M. & Jitsev, J. Alice in Wonderland: simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061; (2025).

Dziri, N. et al. Faith and Fate: Limits of Transformers on Compositionality. Adv Neural Inf. Process Syst. 36, (2023).

Delétang, G. et al. Neural Networks and the Chomsky Hierarchy. 11th International Conference on Learning Representations, ICLR 2023 (2022).

McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D. & Griffiths, T. L. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proc. Natl. Acad. Sci. USA 121, e2322420121 (2024).

Article

Google Scholar

Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. Proc. 2024 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol., NAACL 2024 1, 1819–1862 (2024).

Google Scholar

Kambhampati, S. et al. LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks. (2024).

Pallagani, V. et al. On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS). Proc. Int. Conf. Automated Plan. Sched. 34, 432–444 (2024).

Google Scholar

Huang, J. et al. Large Language Models Cannot Self-Correct Reasoning Yet. 12th International Conference on Learning Representations, ICLR 2024 (2023).

Wang, X. et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. 11th International Conference on Learning Representations, ICLR 2023 (2022).

Zhou, K., Hwang, J., Ren, X. & Sap, M. Relying on the unreliable: the impact of language models’ reluctance to express uncertainty. In Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. (ACL) 3623–3643; (2024).

Du, Y., Li, S., Torralba, A., Tenenbaum, J. B. & Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. Proc. Mach. Learn Res 235, 11733–11763 (2023).

Google Scholar

Golovneva, O., Allen-Zhu, Z., Weston, J. E. & Sukhbaatar, S. Reverse training to nurse the reversal curse. In Proc. 1st Conf. Lang. Model. (2024).

interpreting GPT: the logit lens—LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.

Zou, A. et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310. 01405 (2023).

Chen, Y. et al. Do models explain themselves? counterfactual simulatability of natural language explanations. In Proc. 41st Int. Conf. Mach. Learn. (ICML) 7880–7904; (2024).

Wiegreffe, S. & Pinter, Y. Attention is not not Explanation. EMNLP-IJCNLP 2019 – 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference 11–20 (2019).

Jain, S. & Wallace, B. C. Attention is not explanation. In Proc. 2019 Conference of the North 3543–3556 (2019).

Zhang, Y. et al. Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief. Bioinform. 25, 1–22 (2023).

Article

Google Scholar

Narayanan, S. et al. Aviary: training language agents on challenging scientific tasks. (2024).

Taragin, M. I. Learning from negative findings. Isr. J. Health Policy Res 8, 1–4 (2019).

Article

Google Scholar

Bik, E. M. Publishing negative results is good for science. Access Microbiol 6, 000792 (2024).

Article

Google Scholar

Echevarriá, L., Malerba, A. & Arechavala-Gomeza, V. Researcher’s Perceptions on Publishing “Negative” Results and Open Access. Nucleic Acid Ther. 31, 185 (2021).

Article

Google Scholar

Gray, A. ChatGPT ‘contamination’: estimating the prevalence of LLMs in the scholarly literature. (2024).

Liang, W. et al. Mapping the Increasing Use of LLMs in Scientific Papers. (2024).

Latona, G. R., Ribeiro, M. H., Davidson, T. R., Veselovsky, V. & West, R. The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates. (2024).

Liang, W. et al. Monitoring AI-modified content at scale: a case study on the impact of ChatGPT on AI conference peer reviews. in Proc. 41st Int. Conf. Mach. Learn. (ICML) 29575–29620; (2024).

Liang, W. et al. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. NEJM AI 1, (2024).

Thelwall, M. Can ChatGPT evaluate research quality? J. Data Inf. Sci. 9, 1–21 (2024).

Google Scholar

Meyer, J. G. et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 16, 20 (2023).

Article

Google Scholar

Yanai, I. & Lercher, M. Night science. Genome Biol. 20, 1–3 (2019).

Article

Google Scholar

Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). 2016 529:7587.

Article

Google Scholar

Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science (1979) 362, 1140–1144 (2018).

MathSciNet

Google Scholar

Zhao, A. et al. Absolute Zero: reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335; (2025).

Tegnér, J. N. et al. Computational disease modeling—Fact or fiction? BMC Syst. Biol. 3, 1–3 (2009).

Article

Google Scholar

Petroni, F. et al. Language Models as Knowledge Bases? EMNLP-IJCNLP 2019—2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference 2463–2473, (2019).

Lee, M. A mathematical investigation of hallucination and creativity in gpt models. Mathematics 11, 2320 (2023).

Article

Google Scholar

Chen, M. et al. Evaluating Large Language Models Trained on Code. (2021).

Peng, B. et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302. 12813 (2023).

Luo, L., Li, Y.-F., Haffari, G. & Pan, S. Reasoning on graphs: faithful and interpretable large language model reasoning. (2023).

Ji, Z. et al. Towards Mitigating LLM Hallucination via Self Reflection. 1827–1843, (2023).

Yao, S. et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Adv. Neural Inf. Process Syst. 36 (2023).

Ye, X. & Durrett, G. The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning. Adv. Neural Inf. Process Syst. 35 (2022).

Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L. & Geva, M. Patchscopes: a unifying framework for inspecting hidden representations of language models. Proc. Mach. Learn Res. 235, 15466–15490 (2024).

Google Scholar

Hyland, K. Academic publishing and the myth of linguistic injustice. J. Second Lang. Writ. 31, 58–69 (2016).

Article

Google Scholar

Clavero, M. “Awkward wording. Rephrase”: linguistic injustice in ecological journals. (2010).

Strauss, P. Shakespeare and the English poets: the influence of native speaking English reviewers on the acceptance of journal articles. Publications 7, 20 (2019).

Article

Google Scholar

link