Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (2025)

Jiyue Jiang,Pengan Chen,Jiuming Wang,Dongchen He,Ziqin Wei,Liang Hong,
Licheng Zong,Sheng Wang,Qinze Yu,Zixian Ma,Yanyu Chen,Yimin Fan,
Xiangyu Shi,Jiawei Sun,Chuan Wu,Yu Li
The Chinese University of Hong Kong, The University of Hong Kong, Shanghai AI Lab
{jiangjy, jmwang, dche, 1155173761, liang.hong, lczong, 1155183728, 1155191449,
fanyimin}@link.cuhk.edu.hk,{cpa2001, u3009618}@connect.hku.hk, {charliegood2019,
sxysxygm}@gmail.com,sunjiawei1@pjlab.org.cn,cwu@cs.hku.hk,liyu@cse.cuhk.edu.hk
Equal Contribution

Abstract

Large language models (LLMs) have become important tools in solving biological problems, offering improvements in accuracy and adaptability over conventional methods. Several benchmarks have been proposed to evaluate the performance of these LLMs. However, current benchmarks can hardly evaluate the performance of these models across diverse tasks effectively. In this paper, we introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark, which includes 30 key bioinformatics tasks covering areas such as proteins, RNA, drugs, electronic health records, and traditional Chinese medicine. Using this benchmark, we evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, etc., using 0-shot and few-shot Chain-of-Thought (CoT) settings without fine-tuning to reveal their intrinsic capabilities. To improve the efficiency of our evaluations, we demonstrate BioFinder, a new tool for extracting answers from LLM responses, which increases extraction accuracy by round 30% compared to existing methods. Our benchmark results show the biological tasks suitable for current LLMs and identify specific areas requiring enhancement. Furthermore, we propose targeted prompt engineering strategies for optimizing LLM performance in these contexts. Based on these findings, we provide recommendations for the development of more robust LLMs tailored for various biological applications. This work offers a comprehensive evaluation framework and robust tools to support the application of LLMs in bioinformatics.

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting


Jiyue Jiang,Pengan Chenthanks: Equal Contribution,Jiuming Wang,Dongchen He,Ziqin Wei,Liang Hong,Licheng Zong,Sheng Wang,Qinze Yu,Zixian Ma,Yanyu Chen,Yimin Fan,Xiangyu Shi,Jiawei Sun,Chuan Wu,Yu Li The Chinese University of Hong Kong, The University of Hong Kong, Shanghai AI Lab{jiangjy, jmwang, dche, 1155173761, liang.hong, lczong, 1155183728, 1155191449,fanyimin}@link.cuhk.edu.hk,{cpa2001, u3009618}@connect.hku.hk, {charliegood2019,sxysxygm}@gmail.com,sunjiawei1@pjlab.org.cn,cwu@cs.hku.hk,liyu@cse.cuhk.edu.hk


1 Introduction

Computational methods have become essential for solving many biological problems, such as protein foldingJumper etal. (2021), function annotationSingh (2024), and designing new biomoleculesLisanza etal. (2024). With the rise of language models in natural language processing (NLP)Brown (2020); Jiang etal. (2023), especially LLMsTouvron etal. (2023); Bubeck etal. (2023); Guo etal. (2025), these powerful tools can now be applied to biology as well, especially models with over 1 billion parameters, such as ProLLaMALv etal. (2024), EvoNguyen etal. (2024); Brixi etal. (2025), and Med-PaLM 2Singhal etal. (2023).In addition to handling biomedical text data such as electronic health record (EHR) and traditional Chinese medicine (TCM) question-answering (QA) systems, LLMs are also effective in analyzing biological targets like proteins and nucleic acids due to their similarity to natural language sequences. Utilizing LLMs in biological tasks has resulted in significant advancements over traditional methods, including improved accuracy, better generalization, and enhanced learning from fewer samplesZhang etal. (2024b).

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (1)

A key strategy for leveraging LLMs in biology is prompting, which involves designing specific input formats to guide the model’s outputLiu etal. (2023). Prompting offers several benefits: it requires less computational power and data compared to fine-tuning, enables rapid adaptation to various tasks, and fully utilizes the model’s existing knowledge. Additionally, prompting facilitates the resolution of complex biological questions by framing them in a way that LLMs can effectively interpret and answer.

Many specialized LLMs have been developed for different biological tasksWang etal. (2024a); Lv etal. (2024); Bolton etal. (2024), but there is currently no thorough investigation of which biological task is better suited for which LLM architecture.Benchmarking is a rigorous method for assessing the compatibility and performance of an LLM architecture across a variety of tasksMcIntosh etal. (2024); Jiang etal. (2024); Ali etal. (2024).However, current benchmark methods for LLMs on biological tasks face several challenges.First, most benchmarks designed for small-scale models are inadequate for evaluating LLMs due to their expanded hypothesis space.Second, existing validation datasets often contain overlapping data, highlighting the necessity for high-quality, clean benchmark sets.Third, despite the growing interest in developing LLMs for diverse tasks, there are limited benchmarks tailored to evaluate these comprehensive models.

Inspired by these needs, this paper introduces a comprehensive prompting-based benchmarking scheme for evaluating the performance of LLMs on various biological tasks.We present evaluation methods for LLMs using either biological sequence data or biomedicine-related text inputs, and apply these benchmarks to currently available LLMs, with an emphasis on general LLMs such as GPT-4o.Our assessment specifically focuses on benchmarking vanilla LLMs without fine-tuning under zero-shot or few-shot CoTWei etal. (2022) conditions to test the intrinsic abilities of LLMs, which can also better inform their performance after fine-tuning.Moreover, as LLMs often embed the key answer within the generated text output and there are currently no effective methods for extracting biological sequences and textsGu etal. (2022), we propose a novel answer extraction method, BioFinder, to retrieve key answers from responses and enhance the reliability of our LLM benchmark.Based on our comprehensive benchmark results, we analyze and summarize the biological tasks that are suitable for current LLMs to solve and provide architectural recommendations to guide the development of future LLMs for diverse biological tasks.

Our main contributions are as follows: (1) we propose a bioinformatics benchmark (Bio-benchmark) including 30 important tasks related to protein, RNA, RNA-binding protein (RBP), drug, electronic health record, medical QA and traditional Chinese medicine. We further test these tasks in zero-shot and few-shot (with CoT) settings across six mainstream vanilla LLMs without fine-tuning: GPT-4oOpenAI (2024), Qwen-2.5-72bHui etal. (2024), Llama-3.1-70bDubey etal. (2024), Mistral-large-2Mistral-AI (2024), Yi-1.5-34b0.1AI (2024), and InternLM-2.5-20bCai etal. (2024); (2) we propose an answer extraction tool (BioFinder) for LLMs on bioinformatics tasks to accurately extract answers from LLMs responses, which exceeds the accuracy of existing methods by over 40%; (3) we evaluate and analyze the answers extracted by xFinderYu etal. (2024) and BioFinder on different tasks, identify which benchmarked tasks are well-suited for current state-of-the-art LLMs, and propose a prompt-based approach to enhance LLM performance on tasks that are currently less effectively addressed.

2 Benchmark construction

Protein.

Datasets for predicting protein secondary structure and function are developed using data from the Protein Data Bank (PDB), involving deduplication and length restrictions to manage computational load. The protein secondary structure dataset was diversified through random sampling based on sequence lengths. For species prediction, sampling is guided by sequence distributions across species and family sequence counts, ensuring a balanced representation of families. These preprocessing steps are critical for building robust models that accurately predict protein functionalities or species, advancing research in this field.

We involve four subtasks in the protein section, including protein family sequence design, protein species prediction, protein inverse-folding design, and protein structure prediction. For instance, targeting on protein inverse-folding design task, given the protein secondary structure "CCSHHHHHHHHHHHHHHHHHHHHHHCCCCC", we prompt the language model to design the protein sequence whose secondary structure matched the given one. Protein structure prediction tasks are used to predict the secondary structure for a given specific protein sequence. For protein family sequence design, we aim to generate protein sequences belonging to a specific protein family based on the protein family ID, like "PF01713". For protein species classification, we utilize a language model to predict the category of each protein based on its sequence and the predefined protein species categories.

RNA.

The dataset construction for RNA secondary structure prediction, inverse folding, and functional prediction involved meticulous preprocessing across multiple sources. The bpRNA dataset provide the foundation for RNA structure and folding benchmarks, where sequences underwent deduplication, removal of entries exceeding 1024 nucleotides, and random sampling based on sequence length distributions to ensure diversity and representativeness.

For RNA functional prediction, data from RNACentral are used, with samples select based on the distribution of sequences across different functions and families. This approach ensure a balanced representation of various RNA families and functions, facilitating the development of robust models for accurately predicting RNA characteristics and behaviors in molecular biology research.

For the RNA section, we involve 5 subtasks, including RNA family sequence design, RNA function prediction, RNA inverse-folding design, RNA structure prediction, sgRNA efficiency prediction. The first four tasks are consistent with those in the protein section, while the last involve predicting the efficiency of a given sgRNA, such as "AAAAAAAAUUGGAGGCCUUUCCCCUGGGCA", with an example efficiency prediction of 90%.

RNA-binding protein.

To rigorously define a benchmark set for RNA-protein interaction-related tasks, we first obtain a set of RNA sequences experimentally verified to interact with various RNA-binding proteins. We then categorize these proteins by their binding domains, such as RNA recognition motif (RRM) and zinc finger. For the sampled protein of each binding domain, we curate balanced positive and negative RNA sequences binding to this protein. Notably, the same RNA-binding protein can demonstrate varying behavior in different species, so the species information is also included in the prompt benchmark. The classification benchmark is then constructed accordingly by prompting the model with the protein name, species, and RNA sequence information.

In this section, we involve RNA-binding protein prediction tasks. Given a specific RNA sequence like "AGAAGGUGUGAGAUUAAUGGAUGGGGUAGCUGACG", we prompt the language model to decide whether it can bind to a specific protein like "Pp_0237" or not.

Drug.

In constructing a comprehensive drug benchmark for machine learning-guided drug design, three essential tasks are addressed: antibiotic design, drug-drug interactions, and drug-target interactions. These tasks are critical as they encompass the full spectrum of drug development, from discovery through to clinical application, and directly impact the success rate, efficiency, and safety of pharmaceuticals. For antibiotic design, the data includes novel antibiotic structures to address the urgent need for treatments against antimicrobial resistance and to test the predictive capabilities of LLMs. In the area of drug-drug interactions, the dataset is carefully curated to include 86 different interaction types, ensuring a broad representation of potential clinical scenarios to enhance medication safety and efficacy. For drug-target interactions, the selected data ensures diversity and specificity with non-overlapping drug structures and targets with low sequence similarity, streamlining the pathway from experimental screening to clinical application.

We focus on three essential tasks: drug-drug interaction prediction, drug-target interaction prediction, and drug design. For example, in the task of drug design, we prompt the model to determine if a molecule like "Cc1c(Br)c(=O)c(C(=O)N2CC(C)S(=O)C(C)C2) cn1C1CC1" can have potent efficacy on A. baumannii (a bacteria). And the example answer should be “Not potent”.

Electronic health record.

EHR data is scarce, particularly within the MIMIC databaseJohnson etal. (2023), which predominantly contains data from ICUs and emergency departments, and is encumbered by privacy-related access restrictions. To address this, three diverse benchmarks have been selected for utilizing EHR data effectively: predicting diagnostic outcomes using patient information and test results, analyzing patient data incrementally to devise treatment plans, and generating medical reports from doctor-patient dialogues. These benchmarks are essential for improving health outcomes and the efficiency of medical services.

General medical question-answering.

The selected data for medical knowledge benchmarks emphasize accuracy and are derived from expert-verified exam questions. These benchmarks are geographically diverse: MedQA sources from exams in Mainland China, Taiwan, and the USA; MedMCQA from Indian exams; and HeadQA from Spanish exams. They cover a wide range of topics, with MMCU including 13 subcategories like medical imaging and infectious diseases, while HeadQA spans six areas such as medicine and psychology. Some datasets simulate real-world medical consultations requiring long-text responses to open-ended questions, although most benchmarks use multiple-choice formats due to limited alternatives, reflecting a compromise in developing practical, 0-shot style medical question answering.

Questiong-Answering in traditional Chinese medicine.

High-quality benchmarks for TCM are rare, primarily consisting of multiple-choice formats that often mix Eastern and Western medical concepts, making it difficult to isolate pure TCM content. The CMB/CMMLU datasets fill this gap by incorporating questions from ancient texts and traditional TCM, expressed in Classical Chinese, which tests models’ understanding of historical language nuances. Furthermore, the TCM SD dataset uses real clinical TCM cases, where questions focus on predicting diseases and syndromes from patient information using specific TCM terminologies, ensuring a practical evaluation of models’ clinical capabilities in TCM.

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (2)

3 Evaluation methods

3.1 Problem definition

Given a set of questions Q={qiqiΣ}𝑄conditional-setsubscript𝑞𝑖subscript𝑞𝑖superscriptΣQ=\{q_{i}\mid q_{i}\in\Sigma^{*}\}italic_Q = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, our goal is to evaluate the performance of LLMs on various tasks. Since we adopt an unconstrained output format using the CoT method, accurately extracting the standard answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the generated text yi=LLM(qi)subscript𝑦𝑖LLMsubscript𝑞𝑖y_{i}=\text{LLM}(q_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = LLM ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is challenging; traditional RegEx-based methods struggle to balance false positives and negatives. To address this, we utilize the BioFinder framework (Section4.1) with an answer extraction function E:ΣΣ:𝐸superscriptΣsuperscriptΣE:\Sigma^{*}\rightarrow\Sigma^{*}italic_E : roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT → roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to extract the key answer ki=E(yi)subscript𝑘𝑖𝐸subscript𝑦𝑖k_{i}=E(y_{i})italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_E ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from the model output yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

We divide the evaluation tasks into objective and subjective evaluations. For objective evaluation tasks (tiTobjsubscript𝑡𝑖subscript𝑇objt_{i}\in T_{\text{obj}}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_T start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT), such as multiple-choice questions and character matching with a definitive standard answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we define an evaluation function Mobj:Σ×Σ×Tobj:subscript𝑀objsuperscriptΣsuperscriptΣsubscript𝑇objM_{\text{obj}}:\Sigma^{*}\times\Sigma^{*}\times T_{\text{obj}}\rightarrow%\mathbb{R}italic_M start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT : roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × italic_T start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT → blackboard_R to compare the extracted answer kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the standard answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and compute the performance of LLM on task tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

Performanceobj=Mobj(ki,ai,ti).subscriptPerformanceobjsubscript𝑀objsubscript𝑘𝑖subscript𝑎𝑖subscript𝑡𝑖\text{Performance}_{\text{obj}}=M_{\text{obj}}(k_{i},a_{i},t_{i}).Performance start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .(1)

For subjective evaluation tasks (tiTsubjsubscript𝑡𝑖subscript𝑇subjt_{i}\in T_{\text{subj}}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_T start_POSTSUBSCRIPT subj end_POSTSUBSCRIPT), such as long-text generation, open-ended question answering, and case analysis, we evaluate the quality of the model’s output through the following:

  • Similarity calculation: Define a similarity function S:Σ×Σ:𝑆superscriptΣsuperscriptΣS:\Sigma^{*}\times\Sigma^{*}\rightarrow\mathbb{R}italic_S : roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT → blackboard_R to compute the textual and semantic similarity between the generated answer yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the reference answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

    si=S(yi,ai).subscript𝑠𝑖𝑆subscript𝑦𝑖subscript𝑎𝑖s_{i}=S(y_{i},a_{i}).italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_S ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .(2)
  • Expertise assessment: Use advanced LLMs like GPT-4o to assess the professional quality of the generated content yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, obtaining a quality score qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

  • Logical consistency judgment: Use the trained BioFinder to perform Natural Language Inference (NLI), defining a logical relation function:

    N:Σ×Σ{Entailment,Contradiction,Neutral}:𝑁superscriptΣsuperscriptΣEntailmentContradictionNeutral\begin{split}N:\Sigma^{*}\times\Sigma^{*}&\rightarrow\\&\{\text{Entailment},\text{Contradiction},\text{Neutral}\}\end{split}start_ROW start_CELL italic_N : roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL → end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL { Entailment , Contradiction , Neutral } end_CELL end_ROW(3)

    to conduct fine-grained analysis of the logical relationship between yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

    ni=N(yi,ai).subscript𝑛𝑖𝑁subscript𝑦𝑖subscript𝑎𝑖n_{i}=N(y_{i},a_{i}).italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_N ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .(4)

Through the BioFinder framework, we can efficiently and accurately extract key answers from LLM-generated texts without constraining the output format, enabling a comprehensive evaluation of the model’s capabilities and generation quality.

3.2 Fine-grained textual evaluation

To evaluate the quality of the text generation, we employ metrics including similarity, expertise, and logical consistency. Specifically, inspired by FACTSOCRE and OLAPHMin etal. (2023); Jeong etal. (2024), we compare the LLM’s response P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG with reference statements categorized as Must-Have (MH) and Nice-to-Have (NH), assigned weights wMHsubscript𝑤MHw_{\text{MH}}italic_w start_POSTSUBSCRIPT MH end_POSTSUBSCRIPT and wNHsubscript𝑤NHw_{\text{NH}}italic_w start_POSTSUBSCRIPT NH end_POSTSUBSCRIPT, respectively, with wMH>wNHsubscript𝑤MHsubscript𝑤NHw_{\text{MH}}>w_{\text{NH}}italic_w start_POSTSUBSCRIPT MH end_POSTSUBSCRIPT > italic_w start_POSTSUBSCRIPT NH end_POSTSUBSCRIPT. Using the BioFinder framework and Natural Language Inference (NLI), we classify the relationship between the LLM’s response and each reference statement as Entailed, Contradicted, or Neutral. Based on this classification, we define the following metrics:

Comprehensiveness measures the extent to which the model’s response correctly includes the reference statements, defined as:

Comprehensiveness(P^)=xEntailedw(x)xSw(x),Comprehensiveness^𝑃subscript𝑥Entailed𝑤𝑥subscript𝑥𝑆𝑤𝑥\text{Comprehensiveness}(\hat{P})=\frac{\sum_{x\in\text{Entailed}}w(x)}{\sum_{%x\in S}w(x)},Comprehensiveness ( over^ start_ARG italic_P end_ARG ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ Entailed end_POSTSUBSCRIPT italic_w ( italic_x ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S end_POSTSUBSCRIPT italic_w ( italic_x ) end_ARG ,(5)

where S=MHNH𝑆MHNHS=\text{MH}\cup\text{NH}italic_S = MH ∪ NH represents the set of all reference statements, Entailed is the subset of statements classified as entailed by BioFinder, and w(x)𝑤𝑥w(x)italic_w ( italic_x ) is the weight of statement x𝑥xitalic_x.

Hallucination Rate evaluates the proportion of contradictory information in the LLM’s response relative to the reference statements, reflecting the degree of misinformation, defined as:

Hallucination rate(P^)=xContradictedw(x)xSw(x),Hallucination rate^𝑃subscript𝑥Contradicted𝑤𝑥subscript𝑥𝑆𝑤𝑥\text{Hallucination rate}(\hat{P})=\frac{\sum_{x\in\text{Contradicted}}w(x)}{%\sum_{x\in S}w(x)},Hallucination rate ( over^ start_ARG italic_P end_ARG ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ Contradicted end_POSTSUBSCRIPT italic_w ( italic_x ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S end_POSTSUBSCRIPT italic_w ( italic_x ) end_ARG ,(6)

where Contradicted is the subset of statements classified as contradicted by BioFinder.

Omission rate measures the extent to which the model’s response omits reference statements (especially must-have information), defined as:

Omission Rate(P^)=xNeutralw(x)xSw(x),Omission Rate^𝑃subscript𝑥Neutral𝑤𝑥subscript𝑥𝑆𝑤𝑥\text{Omission Rate}(\hat{P})=\frac{\sum_{x\in\text{Neutral}}w(x)}{\sum_{x\in S%}w(x)},Omission Rate ( over^ start_ARG italic_P end_ARG ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ Neutral end_POSTSUBSCRIPT italic_w ( italic_x ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S end_POSTSUBSCRIPT italic_w ( italic_x ) end_ARG ,(7)

where Neutral is the subset of statements classified as neutral by BioFinder.

To assess the stability of the LLM’s performance across the dataset, we calculate the variability of the metrics and define Consistency as:

Consistency=1σComp+σHall+σOmitσmax,Consistency1subscript𝜎Compsubscript𝜎Hallsubscript𝜎Omitsubscript𝜎max\text{Consistency}=1-\frac{\sigma_{\text{Comp}}+\sigma_{\text{Hall}}+\sigma_{%\text{Omit}}}{\sum{\sigma_{\text{max}}}},Consistency = 1 - divide start_ARG italic_σ start_POSTSUBSCRIPT Comp end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT Hall end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT Omit end_POSTSUBSCRIPT end_ARG start_ARG ∑ italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG ,(8)

where σCompsubscript𝜎Comp\sigma_{\text{Comp}}italic_σ start_POSTSUBSCRIPT Comp end_POSTSUBSCRIPT, σHallsubscript𝜎Hall\sigma_{\text{Hall}}italic_σ start_POSTSUBSCRIPT Hall end_POSTSUBSCRIPT, σOmitsubscript𝜎Omit\sigma_{\text{Omit}}italic_σ start_POSTSUBSCRIPT Omit end_POSTSUBSCRIPT are the standard deviations of Comprehensiveness, Hallucination Rate, and Omission Rate, respectively, and σmaxsubscript𝜎max\sigma_{\text{max}}italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the sum of maximum possible standard deviations.

We combine the above metrics to define the model’s overall performance score:

Overall Score=αComp¯βHall¯absent𝛼¯Comp𝛽¯Hall\displaystyle=\alpha\overline{\text{Comp}}-\beta\overline{\text{Hall}}= italic_α over¯ start_ARG Comp end_ARG - italic_β over¯ start_ARG Hall end_ARG
γOmit¯+δConsistency,𝛾¯Omit𝛿Consistency\displaystyle\quad-\gamma\overline{\text{Omit}}+\delta\text{Consistency},- italic_γ over¯ start_ARG Omit end_ARG + italic_δ Consistency ,(9)

where Comp¯¯Comp\overline{\text{Comp}}over¯ start_ARG Comp end_ARG, Hall¯¯Hall\overline{\text{Hall}}over¯ start_ARG Hall end_ARG, Omit¯¯Omit\overline{\text{Omit}}over¯ start_ARG Omit end_ARG are the average values of Comprehensiveness, Hallucination Rate, and Omission Rate, respectively, and α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, δ𝛿\deltaitalic_δ are weighting coefficients satisfying α+β+γ+δ=1𝛼𝛽𝛾𝛿1\alpha+\beta+\gamma+\delta=1italic_α + italic_β + italic_γ + italic_δ = 1.

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (3)

4 Experiment and results

4.1 BioFinder

4.1.1 Experiment Setting

As outlined in Section 3.1, evaluating LLMs presents unique challenges. Although the xFinder framework demonstrated optimal performance on diverse tasks like multiple-choice questions, short text matching, and mathematical extraction, it fell short on the Bio-benchmark dataset Yu etal. (2024). To bridge this gap, we adapt xFinder-llama38it111https://huggingface.co/IAAR-Shanghai/xFinder-llama38it into BioFinder, enhancing its capability for biological sequence extraction and long-text Natural Language Inference (NLI) tasks without reference answers.

For the biological sequence extraction, we manually annotate 1,428 samples from six models in Bio-benchmark, focusing on 0-shot and five-shot outputs, and iteratively retrained using GPT4o and RegEx on misextracted sequences Contributors (2023b). The NLI task involve collecting 13,698 samples, applying multi-round reasoning with GPT4o, and refining outputs through self-consistency checks and manual reviews Wang etal. (2022).

Training uses advanced fine-tuning methodologies such as LoRA, PRoLoRA, QLoRA, and MoS Hu etal. (2021); Wang etal. (2024c, d, b); Dettmers etal. (2024), specifically employing the Xtuner framework with QLoRA. Prompts are strategically designed for each task to enhance the model’s adherence to instructions.

4.1.2 Baselines and Evaluation

Objective Answer Extraction

For objective tasks such as multiple-choice questions, short text matching, and mathematical answer extraction, we fine-tune xFinder to enhance its bio-sequence extraction capabilities while retaining all its original functionalities. Table2 compares the performance of BioFinder, regular expressions, and GPT-4. BioFinder achieved a 93.5% extraction accuracy in bio-sequence extraction tasks, exceeding regular expressions by over 30%, and delivering twice the extraction quality of GPT-4 using the same prompt. Across all objective tasks, BioFinder’s average extraction accuracy reached 94.7%, comparable to single-round human review, outperforming regular expressions at 72.1% and GPT-4 at 62.9%.

Subjective Task Evaluation

We evaluate subjective tasks by comparing BioFinder with GPT-4222https://openai.com/index/gpt-4/, RoBERTa-large-mnli333https://huggingface.co/FacebookAI/roberta-large-mnli, and GTE-large-en-v1.5444https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5. The comparison involve mapping model outputs and reference answers into a shared embedding space and assessing cosine similarity. BioFinder outperform the others on a medical NLI test, achieving 89.8% accuracy compared to 59.9% for GPT-4, 27.0% for RoBERTa-large-mnli, and 23.4% for GTE-large-en-v1.5, demonstrating state-of-the-art performance.

RoBERTa-large-mnli, not initially design for long sequences (>>>512 tokens), required text segmentation and a bagging approach to mitigate semantic interruptions. In addition, embedding models are found to be ineffective for NLI tasks, often misclassifying antonym sentences as similar, contrary to their expected ’Contradiction’ classification in NLI tasks, as detailed in Appendix B.

4.2 Bio-benchmark

4.2.1 Experiment Setting

We conduct experiments on the Bio-benchmark datasets. We use Openai-API and 8*A100-80G GPUs to infer LLMs with LM-Deploy and Huggingface backendContributors (2023a); hug (2024). We employ sampling hyperparameters with top-p set to 1.0 and a temperature of 0.2 for generation (Specific prompts in the AppendixD). We use BioFinder to extract the answers of objective evaluation and execute NLI tasks.

4.2.2 Evaluation metrics

To evaluate the performance of Bio-benchmark tasks on LLMs, we use the following metrics: (1) Bit scoreYe etal. (2006); Nawrocki etal. (2009); Nawrocki and Eddy (2013): Measures the alignment quality of sequences, normalizing the raw alignment score by database size and randomness. (2) Recovery rate: Indicates the proportion of relevant instances accurately retrieved from the total relevant instances available. (3) Accuracy (Acc.): Evaluates the proportion of correct predictions made out of the total predictions. (4) BERTScoreZhang etal. (2019): We adopt BERTScore to compare the similarity between embeddings of a generated sentence and the reference sentence. (5) Rouge-L: Evaluates the similarity of text based on the longest common subsequence, focusing on sequence order and content overlap in summaries or translations. (6) GPT evaluation metrics (fluency, relevance, completeness, medicine proficiency).

RegExGPT-4oBioFinderRoBERTa-large-mnliGTE-large-en-v1.5
MCQ Matching77.565.895.5--
Text Matching74.880.594.3--
Numerical Matching68.167.095.5--
Sequence Extraction68.038.593.5--
Overall Objective Tasks72.162.994.7--
NLI Tasks-59.989.827.023.4

4.3 Evaluation metrics

For more detailed Bio-benchmark experimental evaluation metrics, please see Appendix4.3.

1. Protein: Pfam design (bit score), protein species prediction (Acc.), protein inverse folding (recovery rate), protein structure prediction (recovery rate). 2. RNA: RNA function prediction (Acc.), RNA inverse folding (recovery rate), RNA structure prediction (F1), Rfam design (bit score), sgRNA efficiency prediction (Acc.). 3. RBP: RNA-binding protein (Acc.). 4. Drug: Drug-Drug interaction (Acc.), drug-target interaction (Acc.), drug design (Acc.). 5. EHR: Agentclinic (Acc.), CMB-Clinic (fluency, relevance, completeness, medical proficiency), IMCS-MRG (Rouge-L). 6. Medical: HeadQA (Acc.), MedLFQA-HealthQA (COMPREHENSIVENESS), MedLFQA-KQA (COMPREHENSIVENESS), MedLFQA-LiveQA (COMPREHENSIVENESS), MedLFQA-MedicationQA (COMPREHENSIVENESS), MedMCQA (Acc.), MedQA-CN (Acc.), MedQA-TW (Acc.), MedQA-US (Acc.), MMCU (Acc.). 7. TCM: CMB-Exam (Acc.), CMMLU-TCM (Acc.), MedLFQA-TCM (Acc.), TCMSD (Acc.).

5 Analysis

We observe that prompt formats significantly affect LLM performance in biological sequence inference. Specifically, continuous bio-sequences versus those separated by spaces or newline characters lead to different tokenizations due to LLMs using Byte Pair Encoding (BPE). BPE tokenizers often treat several consecutive uppercase letters as a single token, which impairs the ability of LLMs to understand and generate bio-sequences. Figure 4 showed that separating sequences with newline characters achieves over three times the alignment accuracy compared to continuous inputs.

In addition, from Figure2, Figure3 and Table2, we can obtain the following results.

5.1 Protein benchmark

In the protein species prediction, few-shot significantly improved accuracy across all LLMs. Yi-1.5-34b experience at least a sixfold increase, while InternLM-2.5-20b witness a maximum nearly twentyfold increase. Mistral-large-2 lead with 82% accuracy, followed by Llama-3.1-70b at 79%.

For inverse protein folding and protein structure prediction, we measure the sequence recovery rate. The latter task show notably better performance, with accuracy approximately four times higher than the former. Notably, in a five-shot setting, Llama-3.1 70b achieve the highest score of 34% in protein structure prediction.

In addition, we evaluate the proficiency of LLMs in generating protein sequences from specific Pfam IDs. Initially, all LLMs score zero in 0-shot tests, indicating their inability to generate accurate sequences based solely on Pfam IDs. However, with 10-shot prompting, LLMs substantially improve, achieving bit scores over 50%, with GPT-4o reaching the highest at 87%.

5.2 RNA benchmark

We evaluate large language models on RNA sequence processing tasks. For RNA function prediction, similar to protein species prediction, few-shot (5-shot) prompting significantly enhance model performance over 0-shot, with Llama-3.1-70b 5-shot achieving the highest accuracy of 89%.

In RNA-inverse folding and structure prediction tasks, results are generally disappointing. However, RNA inverse folding outperforms structure prediction tasks, with Llama-3.1-70b 5-shot leading with a recovery rate of 26%. Few-shot prompting notably improves outcomes in both RNA tasks.

We also examine sgRNA efficiency prediction, crucial for CRISPR gene editing success. GPT-4o 5-shot and InternLM-2.5-20b 0-shot both score above 30%. Notably, InternLM-2.5-20b performed better in 0-shot than in 5-shot, where it scores zero, suggesting no benefit from extra examples.

Lastly, we evaluate RNA sequence generation from specific Rfam IDs, using bit scores to judge sequence quality. 10-shot prompting consistently outperforms 0-shot across all LLMs, though the average score of 40.78% is lower than the 72.63% achieved in the protein section under the same conditions.

5.3 RBP Benchmark

The RNA binding protein task is used to evaluate the ability of LLMs to predict whether a given RNA sequence can bind to a specific protein. These LLMs achieved an average accuracy of 53%, with the highest scores being 71% by Mistral-large-2 (5-shot) and 61% by GPT-4o (5-shot). In this task, the few-shot prompting strategy slightly improve performance compared to 0-shot prompting.

5.4 Drug Benchmark

According to Figure 5, we conduct experiments on drug sequences similar to those performed on biological sequences. By counting the number of atoms in the drug sequences, we discover that the prompt format significantly impacts the LLMs’ drug-sequence generation. In the drug design task, large language models are tested on their ability to predict the efficacy of molecular entities (SMILES strings) against specific bacteria. Using 5-shot prompting, all LLMs perform well, achieving accuracy scores above 80%, with Mistral-large-2 5-shot topping at 91%. Notably, 5-shot prompting boost the performance of Llama-3.1-70b significantly, from 12% in 0-shot to 86%.

In the Drug-Drug and Drug-Target interaction tasks, LLMs evaluate potential interactions. Drug-Drug interaction predictions are unsatisfactory across all models, with the best score being 0.47 by GPT-4o in 0-shot, which decreases to 0.34 with 5-shot prompting. Conversely, Drug-Target interaction predictions show better results; InternLM-2.5-20b 5-shot and GPT-4o 5-shot achieve the highest scores of 73% and 70%, respectively. 5-shot prompting significantly enhances performance, increasing the average accuracy from 62% in 0-shot to 37%.

5.5 EHR Benchmark

In the EHR benchmark, LLMs show high accuracy across three medical reasoning subtasks: AgentClinic, CMB-Clinic, and IMCS-MRG. For the AgentClinic task, average metric values are 73.2% (0-shot) and 72.7% (5-shot), with GPT-4o achieving a high of 82.24% in 0-shot.

Performance in the CMB-Clinic task was excellent, with average metric values of 0.917 in 0-shot and 0.867 in five-shot settings using CoT reasoning. This performance closely matches reference answers, highlighting the potential of LLMs in medical case diagnosis.

It is noted that for the CMB-Clinic task, the accuracy of GPT-4o slightly decreased in the five-shot setting comparing with the 0-shot setting. This decrease is attributed to interference from example questions and the inclusion of irrelevant information in few-shot prompting, potentially leading to reduced accuracy when longer inputs do not improve reasoning outcomes.

5.6 Medical Benchmark

For the Medical-QA Benchmark, which includes seven datasets, the average performance of LLMs in 0-shot and 5-shot settings is similar. However, the overall performance varied significantly across different datasets. LLMs perform well on multiple-choice problems including HeadQA, MedMCQA, MedQA-CN, MedQA-TW, and MedQA-USA, with average 0-shot metric values of 80.6%, 68.7%, 81.3%, 76.7%, and 74.0%, respectively. Medical QA tasks are particularly well-suited for large language models, which align effectively with the domain.

Furthermore, there is little to no improvement with five-shot prompting, and in some cases, performance decreased. The reason can be that biomedical QA is a task that aligns well with LLMs, achieving high 0-shot performance. The additional five-shot examples provide minimal benefit.

5.7 TCM Benchmark

Overall, LLMs perform well on the TCM benchmark. Out of the four subtasks, only TCMSD show relatively lower performance with an average metric value of 31.7%. Notably, there is a significant improvement in five-shot compared to 0-shot settings across various subtasks. The average metric value improves from 31.7% to 65.3%. This is because TCM is less commonly considered during model training, and the use of prompts can enhance performance. We infer that if LLMs are fine-tuned with the related corpus, the performance could improve further.

6 Conclusion

This study successfully establish a comprehensive benchmarking framework for evaluating the performance of various LLMs on bioinformatics tasks. Utilizing the BioFinder tool, we are able to accurately extract key answers from LLM outputs, significantly enhancing the accuracy of answer extraction. Our results demonstrate that LLMs perform well in multiple subdomains of bioinformatics, such as protein, RNA, and drug design tasks, particularly in few-shot learning settings. Future research may further optimize prompt engineering strategies to enhance the efficiency and precision of LLMs on specific tasks.

Limitations

Despite the comprehensive nature of the Bio-benchmark and the novel approach of the BioFinder tool, our study has several limitations that warrant consideration. Firstly, the benchmark primarily assesses the intrinsic capabilities of LLMs using 0-shot and few-shot settings without fine-tuning. While this approach provides insights into the raw abilities of these models, it may not fully reflect their performance in practical, real-world applications where fine-tuning on specific tasks can significantly enhance effectiveness.

Secondly, while we cover a broad range of biological tasks, the selection of these tasks and their corresponding datasets might still not be representative of all possible bioinformatics challenges. The diversity and complexity of biological data are such that even a comprehensive benchmark like ours cannot encapsulate all potential tasks and scenarios. This limitation could affect the generalizability of our findings and recommendations.

Furthermore, the performance of the BioFinder tool, although superior to existing methods by a significant margin, is heavily dependent on the quality and structure of the data fed into the LLMs. In scenarios where the LLM outputs are ambiguous or overly complex, the extraction accuracy of BioFinder might be compromised. This limitation underscores the ongoing challenge in the field of bioinformatics to develop methods that are robust across a wide variety of data types and quality.

Additionally, our benchmark does not address the computational and financial costs associated with deploying large language models. The use of such models, particularly in 0-shot and few-shot scenarios, can be resource-intensive. This aspect could limit the accessibility of our proposed methods for researchers with limited resources.

Lastly, while our study indicates promising areas for the application of LLMs in bioinformatics, it also reveals that certain tasks remain challenging for these models. Developing LLM architectures or training strategies that can tackle these hard-to-model aspects of biological data remains a key area for future research. The ongoing development of LLMs should focus on enhancing adaptability and accuracy in these less tractable areas to broaden the utility of LLMs in bioinformatics.

Ethics Statement

This paper does not involve an ethics statement.

References

  • hug (2024)2024.Hugging face inference api and endpoints.https://huggingface.co/docs/inference-endpoints/index.
  • 0.1AI (2024)0.1AI. 2024.01-ai/Yi-1.5-34B-Chat · Hugging Face — huggingface.co.https://huggingface.co/01-ai/Yi-1.5-34B-Chat.[Accessed 15-10-2024].
  • Ali etal. (2024)Murtaza Ali, Prerna Rao, Yifan Mai, and Benjamin Xie. 2024.Using benchmarking infrastructure to evaluate llm performance on cs concept inventories: Challenges, opportunities, and critiques.In Proceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1, pages 452–468.
  • Asai etal. (2023)Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023.Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511.
  • Bae etal. (2024)Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, etal. 2024.Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images.Advances in Neural Information Processing Systems, 36.
  • Bolton etal. (2024)Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, etal. 2024.Biomedlm: A 2.7 b parameter language model trained on biomedical text.arXiv preprint arXiv:2403.18421.
  • Brixi etal. (2025)Garyk Brixi, MatthewG Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, GabrielA Gonzalez, SamuelH King, DavidB Li, AditiT Merchant, Mohsen Naghipourfar, Eric Nguyen, Chiara Ricci-Tam, DavidW Romero, Gwanggyu Sun, Ali Taghibakshi, Anton Vorontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguyen, NicholasK Wang, Etowah Adams, StephenA Baccus, Steven Dillmann, Stefano Ermon, Daniel Guo, Rajesh Ilango, Ken Janik, AmyX Lu, Reshma Mehta, MohammadR.K. Mofrad, MadelenaY Ng, Jaspreet Pannu, Christopher Re, JonathanC Schmok, John St.John, Jeremy Sullivan, Kevin Zhu, Greg Zynda, Daniel Balsam, Patrick Collison, AnthonyB. Costa, Tina Hernandez-Boussard, Eric Ho, Ming-Yu Liu, Tom McGrath, Kimberly Powell, DaveP. Burke, Hani Goodarzi, PatrickD Hsu, and Brian Hie. 2025.Genome modeling and design across all domains of life with evo 2.bioRxiv.
  • Brown (2020)TomB Brown. 2020.Language models are few-shot learners.arXiv preprint arXiv:2005.14165.
  • Bubeck etal. (2023)Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, YinTat Lee, Yuanzhi Li, Scott Lundberg, etal. 2023.Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712.
  • Cai etal. (2024)Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, QiFan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, LiMa, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, YuSun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, Jia Yu, Jing Yu, Yuhang Zang, Chuyu Zhang, LiZhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang,Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, YuQiao, and Dahua Lin. 2024.Internlm2 technical report.Preprint, arXiv:2403.17297.
  • Chen etal. (2023)Qijie Chen, Haotong Sun, Haoyang Liu, Yinghui Jiang, Ting Ran, Xurui Jin, Xianglu Xiao, Zhimin Lin, Hongming Chen, and Zhangmin Niu. 2023.An extensive benchmark study on biomedical text generation and mining with chatgpt.Bioinformatics, 39(9):btad557.
  • Chen etal. (2024)Shan Chen, Yingya Li, Sheng Lu, Hoang Van, HugoJWL Aerts, GuerganaK Savova, and DanielleS Bitterman. 2024.Evaluating the chatgpt family of models for biomedical reasoning and classification.Journal of the American Medical Informatics Association, 31(4):940–948.
  • Contributors (2023a)LMDeploy Contributors. 2023a.Lmdeploy: A toolkit for compressing, deploying, and serving llm.https://github.com/InternLM/lmdeploy.
  • Contributors (2023b)OpenCompass Contributors. 2023b.Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass.
  • Contributors (2023c)OpenCompass Contributors. 2023c.Opencompass: A universal evaluation platform for foundation models.GitHub repository.
  • Dettmers etal. (2024)Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024.Qlora: Efficient finetuning of quantized llms.Advances in Neural Information Processing Systems, 36.
  • Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, etal. 2024.The llama 3 herd of models.arXiv preprint arXiv:2407.21783.
  • Dubois etal. (2024)Yann Dubois, Balázs Galambosi, Percy Liang, and TatsunoriB Hashimoto. 2024.Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475.
  • Edwards etal. (2023)Carl Edwards, Aakanksha Naik, Tushar Khot, Martin Burke, Heng Ji, and Tom Hope. 2023.Synergpt: In-context learning for personalized drug synergy prediction and drug design.arXiv preprint arXiv:2307.11694.
  • Gao etal. (2024)Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, TimothyA Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. 2024.When raw data prevails: Are large language model embeddings effective in numerical data representation for medical machine learning applications?arXiv preprint arXiv:2408.11854.
  • Gu etal. (2022)Jiasheng Gu, Hongyu Zhao, Hanzi Xu, Liangyu Nie, Hongyuan Mei, and Wenpeng Yin. 2022.Robustness of learning from task instructions.arXiv preprint arXiv:2212.03813.
  • Guo etal. (2025)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, etal. 2025.Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948.
  • Hamed etal. (2024)AhmedAbdeen Hamed, TamerE Fandy, and Xindong Wu. 2024.Accelerating complex disease treatment through network medicine and genai: A case study on drug repurposing for breast cancer.arXiv preprint arXiv:2406.13106.
  • He etal. (2024a)Chaoqun He, Renjie Luo, Shengding Hu, Yuanqian Zhao, Jie Zhou, Hanghao Wu, Jiajie Zhang, XuHan, Zhiyuan Liu, and Maosong Sun. 2024a.Ultraeval: A lightweight platform for flexible and comprehensive evaluation for llms.arXiv preprint arXiv:2404.07584.
  • He etal. (2024b)Yunzhen He, Hiroaki Yamagiwa, and Hidetoshi Shimodaira. 2024b.Shimo lab at" discharge me!": Discharge summarization by prompt-driven concatenation of electronic health record sections.arXiv preprint arXiv:2406.18094.
  • Hou and Ji (2024)Wenpin Hou and Zhicheng Ji. 2024.Assessing gpt-4 for cell type annotation in single-cell rna-seq analysis.Nature Methods, pages 1–4.
  • Hu etal. (2021)EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen. 2021.Lora: Low-rank adaptation of large language models.Preprint, arXiv:2106.09685.
  • Huang etal. (2024)Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024.An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.arXiv preprint arXiv:2403.02839.
  • Hui etal. (2024)Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, etal. 2024.Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186.
  • Jeong etal. (2024)Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, and Jaewoo Kang. 2024.Olaph: Improving factuality in biomedical long-form question answering.arXiv preprint arXiv:2405.12701.
  • Ji etal. (2021)Yanrong Ji, Zhihan Zhou, Han Liu, and RamanaV Davuluri. 2021.Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120.
  • Jiang etal. (2024)Jiyue Jiang, Pengan Chen, Liheng Chen, Sheng Wang, Qinghang Bao, Lingpeng Kong, YuLi, and Chuan Wu. 2024.How well do llms handle cantonese? benchmarking cantonese capabilities of large language models.arXiv preprint arXiv:2408.16756.
  • Jiang etal. (2023)Jiyue Jiang, Sheng Wang, Qintong Li, Lingpeng Kong, and Chuan Wu. 2023.A cognitive stimulation dialogue system with multi-source knowledge fusion for elders with cognitive impairment.arXiv preprint arXiv:2305.08200.
  • Jin etal. (2021)DiJin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021.What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421.
  • Jin etal. (2019)Qiao Jin, Bhuwan Dhingra, Zhengping Liu, WilliamW Cohen, and Xinghua Lu. 2019.Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146.
  • Johnson etal. (2023)AlistairEW Johnson, Lucas Bulgarelli, LuShen, Alvin Gayles, Ayad Shammout, Steven Horng, TomJ Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, etal. 2023.Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1.
  • Jumper etal. (2021)John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, etal. 2021.Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589.
  • Kong etal. (2024)Deqian Kong, Yuhao Huang, Jianwen Xie, Edouardo Honig, Ming Xu, Shuanghong Xue, Pei Lin, Sanping Zhou, Sheng Zhong, Nanning Zheng, etal. 2024.Dual-space optimization: Improved molecule sequence design by latent prompt transformer.arXiv preprint arXiv:2402.17179.
  • Li etal. (2023a)Jianquan Li, Xidong Wang, Xiangbo Wu, Zhiyi Zhang, Xiaolong Xu, Jie Fu, Prayag Tiwari, Xiang Wan, and Benyou Wang. 2023a.Huatuo-26m, a large-scale chinese medical qa dataset.Preprint, arXiv:2305.01526.
  • Li and Liang (2021)XiangLisa Li and Percy Liang. 2021.Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190.
  • Li etal. (2023b)Zhongshen Li, Junru Jin, Wentao Long, and Leyi Wei. 2023b.Plpmpro: Enhancing promoter sequence prediction with prompt-learning based pre-trained language model.Computers in Biology and Medicine, 164:107260.
  • Lisanza etal. (2024)SidneyLyayuga Lisanza, JacobMerle Gershon, SamuelWK Tipps, JeremiahNelson Sims, Lucas Arnoldt, SamuelJ Hendel, MiriamK Simma, GeLiu, Muna Yase, Hongwei Wu, etal. 2024.Multistate and functional protein design using rosettafold sequence space diffusion.Nature biotechnology, pages 1–11.
  • Liu etal. (2021)Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021.Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021.URL https://arxiv. org/abs/2107.13586.
  • Liu etal. (2023)Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023.Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM Computing Surveys, 55(9):1–35.
  • Lv etal. (2024)Liuzhenghao Lv, Zongying Lin, Hao Li, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, LiYuan, and Yonghong Tian. 2024.Prollama: A protein large language model for multi-task protein language processing.arXiv preprint arXiv:2402.16445.
  • McIntosh etal. (2024)TimothyR McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Paul Watters, and MalkaN Halgamuge. 2024.Inadequacies of large language model benchmarks in the era of generative artificial intelligence.arXiv preprint arXiv:2402.09880.
  • Min etal. (2023)Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, PangWei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023.Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.arXiv preprint arXiv:2305.14251.
  • Mistral-AI (2024)Mistral-AI. 2024.migtissera/Tess-3-Mistral-Large-2-123B · Hugging Face — huggingface.co.https://huggingface.co/migtissera/Tess-3-Mistral-Large-2-123B.[Accessed 15-10-2024].
  • Nawrocki and Eddy (2013)EricP Nawrocki and SeanR Eddy. 2013.Infernal 1.1: 100-fold faster rna homology searches.Bioinformatics, 29(22):2933–2935.
  • Nawrocki etal. (2009)EricP Nawrocki, DianaL Kolbe, and SeanR Eddy. 2009.Infernal 1.0: inference of rna alignments.Bioinformatics, 25(10):1335–1337.
  • Nguyen etal. (2024)Eric Nguyen, Michael Poli, MatthewG. Durrant, Brian Kang, Dhruva Katrekar, DavidB. Li, LiamJ. Bartie, ArminW. Thomas, SamuelH. King, Garyk Brixi, Jeremy Sullivan, MadelenaY. Ng, Ashley Lewis, Aaron Lou, Stefano Ermon, StephenA. Baccus, Tina Hernandez-Boussard, Christopher Ré, PatrickD. Hsu, and BrianL. Hie. 2024.Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336.
  • Notin etal. (2023)Pascal Notin, AaronW Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Hansen Spinner, Nathan Rollins, Ada Shaw, Ruben Weitzman, Jonathan Frazer, etal. 2023.Proteingym: Large-scale benchmarks for protein design and fitness prediction.bioRxiv.
  • OpenAI (2024)OpenAI. 2024.Gpt-4o.https://openai.com/index/hello-gpt-4o/.[Accessed 15-10-2024].
  • Pal etal. (2022)Ankit Pal, LogeshKumar Umapathi, and Malaikannan Sankarasubbu. 2022.Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering.In Conference on health, inference, and learning, pages 248–260. PMLR.
  • Petroni etal. (2019)Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, AlexanderH Miller, and Sebastian Riedel. 2019.Language models as knowledge bases?arXiv preprint arXiv:1909.01066.
  • Ren etal. (2024)Yuchen Ren, Zhiyuan Chen, Lifeng Qiao, Hongtai Jing, Yuchen Cai, Sheng Xu, Peng Ye, Xinzhu Ma, Siqi Sun, Hongliang Yan, etal. 2024.Beacon: Benchmark for comprehensive rna tasks and language models.arXiv preprint arXiv:2406.10391.
  • Robinson etal. (2022)Joshua Robinson, ChristopherMichael Rytting, and David Wingate. 2022.Leveraging large language models for multiple choice question answering. arxiv 2022.arXiv preprint arXiv:2210.12353.
  • Röttger etal. (2024)Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, HannahRose Kirk, Hinrich Schütze, and Dirk Hovy. 2024.Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models.arXiv preprint arXiv:2402.16786.
  • Runge etal. (2024)Frederic Runge, Karim Farid, JorgKH Franke, and Frank Hutter. 2024.Rnabench: A comprehensive library for in silico rna modelling.bioRxiv, pages 2024–01.
  • Singh (2024)Arunima Singh. 2024.Llms predict protein phases.Nature Methods, pages 1–1.
  • Singhal etal. (2023)Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, LeHou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, etal. 2023.Towards expert-level medical question answering with large language models.arXiv preprint arXiv:2305.09617.
  • Soni etal. (2023)Sarvesh Soni, Surabhi Datta, and Kirk Roberts. 2023.quehry: a question answering system to query electronic health records.Journal of the American Medical Informatics Association, 30(6):1091–1102.
  • Tan etal. (2023)Yang Tan, Mingchen Li, Zijie Huang, Huiqun Yu, and Guisheng Fan. 2023.Medchatzh: a better medical adviser learns from better instructions.arXiv preprint arXiv:2309.01114.
  • Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal. 2023.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971.
  • Wang etal. (2024a)Ning Wang, Jiang Bian, Yuchen Li, Xuhong Li, Shahid Mumtaz, Linghe Kong, and Haoyi Xiong. 2024a.Multi-purpose rna language modelling with motif-aware pretraining and type-guided fine-tuning.Nature Machine Intelligence, pages 1–10.
  • Wang etal. (2024b)Sheng Wang, Liheng Chen, Pengan Chen, Jingwei Dong, Boyang Xue, Jiyue Jiang, Lingpeng Kong, and Chuan Wu. 2024b.Mos: Unleashing parameter efficiency of low-rank adaptation with mixture of shards.arXiv preprint arXiv:2410.00938.
  • Wang etal. (2024c)Sheng Wang, Liheng Chen, Jiyue Jiang, Boyang Xue, Lingpeng Kong, and Chuan Wu. 2024c.Lora meets dropout under a unified framework.arXiv preprint arXiv:2403.00812.
  • Wang etal. (2024d)Sheng Wang, Boyang Xue, Jiacheng Ye, Jiyue Jiang, Liheng Chen, Lingpeng Kong, and Chuan Wu. 2024d.Prolora: Partial rotation empowers more parameter-efficient lora.arXiv preprint arXiv:2402.16902.
  • Wang etal. (2022)Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, EdChi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022.Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171.
  • Wang etal. (2023a)Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, etal. 2023a.Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.arXiv preprint arXiv:2306.05087.
  • Wang etal. (2023b)Zeyuan Wang, Qiang Zhang, Keyan Ding, Ming Qin, Xiang Zhuang, Xiaotong Li, and Huajun Chen. 2023b.Instructprotein: Aligning human and protein language via knowledge instruction.arXiv preprint arXiv:2310.03269.
  • Wei etal. (2022)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocV Le, Denny Zhou, etal. 2022.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837.
  • Wu etal. (2024a)Peng Wu, Huabin Du, Yingchao Yan, Tzong-Yi Lee, Chen Bai, and Song Wu. 2024a.Guided diffusion for molecular generation with interaction prompt.Briefings in Bioinformatics, 25(3):bbae174.
  • Wu etal. (2024b)Xidong Wu, Yiming Zeng, Arun Das, Sumin Jo, Tinghe Zhang, Parth Patel, Jianqiu Zhang, Shou-Jiang Gao, Dexter Pratt, Yu-Chiao Chiu, etal. 2024b.regulogpt: Harnessing gpt for knowledge graph construction of molecular regulatory pathways.bioRxiv.
  • Xiao etal. (2024)XiXiao, Wentao Wang, Jiacheng Xie, Lijing Zhu, Gaofei Chen, Zhengji Li, Tianyang Wang, and Min Xu. 2024.Hgtdp-dta: Hybrid graph-transformer with dynamic prompt for drug-target binding affinity prediction.arXiv preprint arXiv:2406.17697.
  • Xu etal. (2022)Minghao Xu, Zuobai Zhang, Jiarui Lu, Zhaocheng Zhu, Yangtian Zhang, MaChang, Runcheng Liu, and Jian Tang. 2022.Peer: a comprehensive and multi-task benchmark for protein sequence understanding.Advances in Neural Information Processing Systems, 35:35156–35173.
  • Ye etal. (2024)Fei Ye, Zaixiang Zheng, Dongyu Xue, Yuning Shen, Lihao Wang, Yiming Ma, Yan Wang, Xinyou Wang, Xiangxin Zhou, and Quanquan Gu. 2024.Proteinbench: A holistic evaluation of protein foundation models.arXiv preprint arXiv:2409.06744.
  • Ye etal. (2006)Jian Ye, Scott McGinnis, and ThomasL Madden. 2006.Blast: improvements for better sequence analysis.Nucleic acids research, 34(suppl_2):W6–W9.
  • Yizhen etal. (2024)LiYizhen, Huang Shaohan, QiJiaxing, Quan Lei, Han Dongran, and Luan Zhongzhi. 2024.Exploring the comprehension of chatgpt in traditional chinese medicine knowledge.arXiv preprint arXiv:2403.09164.
  • Yu etal. (2024)Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu Li, Feiyu Xiong, BoTang, and Ding Chen. 2024.xfinder: Robust and pinpoint answer extraction for large language models.arXiv preprint arXiv:2405.11874.
  • Yukun etal. (2024)Zhao Yukun, Yan Lingyong, Sun Weiwei, Xing Guoliang, Wang Shuaiqiang, Meng Chong, Cheng Zhicong, Ren Zhaochun, and Yin Dawei. 2024.Improving the robustness of large language models via consistency alignment.arXiv preprint arXiv:2403.14221.
  • Zhang etal. (2024a)KeZhang, Yimiao Feng, and Jie Zheng. 2024a.Prompt-based generation of natural language explanations of synthetic lethality for cancer drug discovery.In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13131–13142.
  • Zhang etal. (2024b)Qiang Zhang, Keyang Ding, Tianwen Lyv, Xinda Wang, Qingyu Yin, Yiwen Zhang, Jing Yu, Yuhao Wang, Xiaotong Li, Zhuoyi Xiang, etal. 2024b.Scientific large language models: A survey on biological & chemical domains.arXiv preprint arXiv:2401.14656.
  • Zhang etal. (2019)Tianyi Zhang, Varsha Kishore, Felix Wu, KilianQ Weinberger, and Yoav Artzi. 2019.Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675.
  • Zheng etal. (2023)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, etal. 2023.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623.
  • Zhu etal. (2023)Lianghui Zhu, Xinggang Wang, and Xinlong Wang. 2023.Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631.
  • Zou etal. (2023)Shuxian Zou, Shentong Mo, Hui Li, Xingyi Cheng, LeSong, and Eric Xing. 2023.Linker-tuning: Optimizing continuous prompts for heterodimeric protein prediction.

Appendix A Appendix

A.1 Results

TaskSubtaskCountGPT-4oInternLM-2.5 20bLlama-3.1 70bMistral-large-2Qwen 2.5-72bYi-1.5 34b
0-shot5-shot0-shot5-shot0-shot5-shot0-shot5-shot0-shot5-shot0-shot5-shot
ProteinProtein-species-prediction2009.0076.504.0078.509.5079.008.5082.0010.0076.0012.0075.00
ProteinProtein-inverse-folding2646.976.291.464.646.696.885.656.707.026.954.485.77
ProteinProtein-structure-prediction26425.0929.793.9628.2124.6334.3121.0224.2318.1127.135.9923.99
RBPRNA-binding-protein7057.1461.4344.2950.0052.8651.4352.8671.4347.1447.1448.5748.57
RNARNA-function-prediction2804.6487.862.1478.573.9388.574.6471.796.0779.293.9391.07
RNARNA-inverse-folding20019.7621.7120.6427.1919.5026.2520.6222.9020.2821.9212.9921.82
RNARNA-structure-prediction2000.502.140.002.260.070.750.650.100.430.070.090.37
RNAsgRNA-efficiency-prediction3000.6739.3336.670.001.330.677.670.330.3320.670.0013.33
DrugDrug-Drug-interaction8646.5133.7212.7910.4736.0536.0525.5836.0534.8831.4022.0929.07
DrugDrug-Target-interaction6043.3370.0040.0073.3351.6761.6710.0050.0043.3358.3333.3358.33
DrugDrug-design5870.6984.4881.0384.4812.0786.2148.2891.3844.8387.9358.6286.21
EHRAgentclinic21482.2462.1563.5569.1679.4480.8478.9778.9773.8377.5761.2167.29
EHRCMB-Clinic7494.9393.9280.9584.6688.9085.1495.5477.0898.5697.9791.5581.35
EHRIMCS-MRG20063.0269.0262.820.0067.2170.2161.5668.6462.9868.4869.2069.40
MedicalHeadQA12090.0090.8366.6766.6786.6783.3386.6783.3382.5083.3370.8364.17
MedicalMedLFQA-HealthQA5043.7643.0344.2745.0326.7531.8439.1945.2437.3743.2040.8142.67
MedicalMedLFQA-KQA5042.3440.4132.7335.8918.1226.7632.9337.5530.8236.2530.7726.88
MedicalMedLFQA-LiveQA5029.5729.6026.0528.6917.7619.9524.8028.6725.0728.2522.8219.85
MedicalMedLFQA-MedicationQA5032.7036.9433.5436.3121.4524.2528.9933.8435.8236.2628.5127.08
MedicalMedMCQA10078.0079.0056.0056.0076.0080.0078.0084.0067.0070.0057.0060.00
MedicalMedQA-CN5082.0082.0082.0080.0084.0082.0070.0072.0088.0090.0082.0084.00
MedicalMedQA-TW5090.0088.0060.0066.0086.0086.0080.0076.0082.0082.0062.0064.00
MedicalMedQA-US5092.0080.0052.0054.0086.0084.0084.0080.0084.0068.0046.0048.00
MedicalMMCU14259.1561.2735.9238.0362.6859.1548.5946.4862.6863.3846.4849.30
TCMCMB-Exam20056.0060.5062.5061.5050.5051.0043.5045.5059.5064.0054.5068.50
TCMCMMLU-TCM18572.9773.5180.0082.7065.4171.3561.6257.8482.7083.2478.3883.78
TCMMedLFQA-TCM20072.0073.5082.0085.0063.0064.5059.0061.5082.5090.0075.5087.50
TCMTCMSD20039.0068.7530.0060.7516.0069.2532.2567.0038.2567.2534.5059.00
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (4)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (5)

A.2 Related Work

A.2.1 Bioinformatics Benchmark

With the emergence of LLMs for various biological tasks, there is also a growing number of benchmark methods designed for different inputs, including RNA, protein, and biomedicine-related texts.Protein is one of the first biological fields that LLM has been employed with multiple established benchmarks, including protein fitness predictionNotin etal. (2023), protein designYe etal. (2024), and multi-task onesXu etal. (2022).As for the more recently studied RNA LLMs, benchmarks have been created to include various structure-, function-, and engineering-related tasksRunge etal. (2024); Ren etal. (2024).For LLMs working with biomedical textual data, there have also been various benchmark datasets, focusing on EHRBae etal. (2024); Soni etal. (2023), medicine QAJin etal. (2021); Pal etal. (2022); Jin etal. (2019), and specifically traditional Chinese medicine QATan etal. (2023); Li etal. (2023a).

A.2.2 Answer Quality Assessment

Bio-benchmark dataset encompasses diverse tasks with complex answer formats, including objective tasks like Multiple-Choice Questions (MCQA)Robinson etal. (2022); He etal. (2024a), short text matching, mathematical inference, and biological sequence prediction, as well as subjective tasks such as long text generation and open-ended QA. Consequently, the answer formats generated by LLMs are highly complex and varied.

LLMs’ outputs are often uncontrollable; the output of different LLMs could differ significantly even with the same promptYukun etal. (2024); Gu etal. (2022). Traditional evaluation frameworks attempt to constrain outputs through prompts, but this can negatively impact generation accuracy and reasoning abilityRöttger etal. (2024). For complex tasks, LLMs’ instruction-following ability decreases, leading to answers that do not adhere to specific formatsAsai etal. (2023).

Traditional evaluation frameworks like LM Eval Harness and OpenCompass typically use regular expressions (RegEx) or Judge models to extract answersGao etal. (2024); Contributors (2023c); Dubois etal. (2024); Zheng etal. (2023), aiming to cover various output formats. This approach requires significant effort in designing RegEx patterns and struggles to balance false positives and omissionsZhu etal. (2023); Wang etal. (2023a). Moreover, the generality and accuracy of judge models are significantly inferior to advanced LLMs or human evaluationHuang etal. (2024), and the cost of using advanced LLMs or human evaluators on large datasets is prohibitively high. To address these challenges, xFinder proposed a framework that uses LLMs for inference, achieving excellent results in answer extractionYu etal. (2024). However, this framework’s capability and applicability in bioinformatics need enhancement; it cannot extract answers from data without reference answers and currently cannot assess the quality of long-text responses.

A.2.3 Bioinformatics Prompting

Biological Sequences

In bioinformatics, prompting techniques have been applied to both protein and nucleic acid sequences.For DNA sequences, template-based prompts generate natural language explanations of gene interactions, benefiting various aspects like the interpretability of synthetic lethality predictions for drug discoveryPetroni etal. (2019); Liu etal. (2021); Zhang etal. (2024a).Soft prompts integrate DNA sequences into tunable templatesLi etal. (2023b), improving the performance of models like DNABERTJi etal. (2021) in recognizing promoter sequences.For RNA sequences, CoT and repeated prompt have been applied on RNA-seq-based cell type annotation with GPT-3 and GPT-4Hou and Ji (2024).For protein sequences, interaction promptsWu etal. (2024a) and continuous promptsLi and Liang (2021); Zou etal. (2023) aid in predicting protein structures and interactions by embedding task-specific information directly into the input sequences.Additionally, protein CoT has also been applied in predicting nonphysical protein-protein interactionWang etal. (2023b).

Drugs

Prompting methods are also utilized in drug-related tasks, including drug-target binding affinity prediction: dynamic prompt generation captures unique interactions between drugs and their targets by integrating context-specific prompts with molecular featuresXiao etal. (2024).Additionally, latent vector prompts enhance molecular design by incorporating continuous prompts into transformer models through cross-attention mechanismsKong etal. (2024).In-context learning strategies, such as those used for predicting synergistic drug combinations, leverage masking and graph representations to improve personalized drug synergy predictionsEdwards etal. (2023).

Biological Textual Data

Prompting techniques extend to biological textual data, including EHR and biomedical QA systems.In EHRs, explanatory prompts streamline clinical documentation by contextualizing each section, reducing the clinicians’ workloadHe etal. (2024b).Prompt engineering methods, including paraphrasing and persona-based instructions, enhance the effectiveness of LLM embeddings for medical diagnostics and prognosticsGao etal. (2024); Chen etal. (2023).In biomedical QA systems, strategies like CoTChen etal. (2024); Wu etal. (2024b) and Graph of Thoughts (GoT)Hamed etal. (2024) can help improve reasoning abilities and reduce misinformation in responses.Specialized prompts for TCM QA systems incorporate domain-specific knowledge and examples, optimizing model performance in specialized medical contextsYizhen etal. (2024).

Appendix B Appendix B

We evaluated the cosine similarities among various sentences embedded using the GTE-large-en-v1.5 model. Specifically, SENTENCE1-CONCLUSION represents the LLM’s response, ANSWER denotes the reference answer, and INVERSE ANSWER is the negation of the reference answer. As illustrated in Figure6, when the reference answer is negated, the similarity between the reference answer and its inverse is second only to the similarity between the CONCLUSION and the reference answer. This observation indicates that the embedding model does not effectively capture the logical and semantic relationships between sentences.

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (6)

Appendix C Overall evaluation metric values

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (7)

C.1 Bioinformatics Sequence

Appendix D Prompt Template in Inference

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (8)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (9)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (10)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (11)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (12)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (13)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (14)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (15)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (16)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (17)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (18)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (19)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (20)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (21)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (22)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (23)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (24)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (25)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (26)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (27)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (28)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (29)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (30)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (31)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (32)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (33)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (34)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (35)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (36)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (37)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (38)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Carlyn Walter

Last Updated:

Views: 6521

Rating: 5 / 5 (70 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Carlyn Walter

Birthday: 1996-01-03

Address: Suite 452 40815 Denyse Extensions, Sengermouth, OR 42374

Phone: +8501809515404

Job: Manufacturing Technician

Hobby: Table tennis, Archery, Vacation, Metal detecting, Yo-yoing, Crocheting, Creative writing

Introduction: My name is Carlyn Walter, I am a lively, glamorous, healthy, clean, powerful, calm, combative person who loves writing and wants to share my knowledge and understanding with you.