International Classification of Primary Care (ICPC-2) and search engines: an exploration of three algorithms for information retrieval to aid medical coding

Vinicius Anjos de Almeida; Egbert J. van der Haring; Kees van Boven; Luis Fernandez Lopez

doi:10.21037/jmai-24-341

Original Article

International Classification of Primary Care (ICPC-2) and search engines: an exploration of three algorithms for information retrieval to aid medical coding

Vinicius Anjos de Almeida¹ , Egbert J. van der Haring², Kees van Boven³ , Luis Fernandez Lopez¹

¹Departamento de Medicina Legal, Faculdade de Medicina da Universidade de São Paulo, São Paulo, SP, Brazil; ²Independent Researcher, formerly at eggbird, Leeuwarden, The Netherlands; ³Department of Primary and Community Care, Radboud Institute of Medical Innovation, Radboud University Medical Centre, Nijmegen, The Netherlands

Contributions: (I) Conception and design: VA de Almeida; (II) Administrative support: VA de Almeida, LF Lopez; (III) Provision of study materials or patients: VA de Almeida; (IV) Collection and assembly of data: VA de Almeida; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Vinicius Anjos de Almeida, MD. Departamento de Medicina Legal, Faculdade de Medicina da Universidade de São Paulo, Av. Dr. Arnaldo, 455 - Cerqueira César, 01246-903 São Paulo, SP, Brazil. Email: vinicius.almeida@alumni.usp.br.

Background: Medical coding is an essential process to collect structured data in healthcare. The International Classification of Primary Care, 2^nd edition (ICPC-2), is more concise and appropriate for primary care compared to other classifications. However, healthcare professionals still struggle to correctly attribute codes in various clinical scenarios. Modern tools are necessary to help efficiently and accurately find the right codes. This study’s goal was to evaluate and compare three different information retrieval algorithms for retrieving ICPC-2 codes.

Methods: Three different strategies for information retrieval were compared. The strategies include BM25, Levenshtein distance, and semantic search with embeddings from Large Language Models (LLMs) from different providers. As embedding models, we included models from OpenAI (text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large), Cohere (embed-multilingual-v2.0, and embed-multilingual-v3.0 with the subtypes search_document, search_query, classification, and clustering), and Gemini (embedding-001 with the subtypes semantic_similarity, retrieval_query, retrieval_document, classification, and clustering). An official thesaurus for ICPC-2 codes in Brazilian Portuguese was used to develop a search engine. It was made publicly available and shared through social media with primary care professionals for data collection. A total of 11,868 queries were collected, of which 7,671 (64.6%) were unique. A random sample of 437 unique expressions (5.7% of the unique queries) was annotated with ICPC-2 through peer review. Annotation involved selecting the relevant ICPC-2 codes for each query. After this process, 398 entries (5.2% of the unique queries) were included in the evaluation dataset. Precision at k (P@k) and average precision at k (AP@k) were used as evaluation metrics, computed for each query and averaged over all queries. The analysis was conducted in a sample both weighted and non-weighted for query frequency. Analysis of variance (ANOVA) one-way and Tukey’s tests were used for hypothesis testing.

Results: The evaluation dataset with 398 queries encoded with ICPC-2 was used to evaluate the results of the three different information retrieval algorithms and of the different models. Semantic search with embeddings from LLMs outperformed BM25 and Levenshtein distance in all assessed metrics: P@1, P@5, P@10, AP@1, AP@5, AP@10 (P value <0.001). When comparing different embedding models, the OpenAI model performed better than the others in most metrics. The OpenAI model text-embedding-3-large showed at least one relevant result in the top 10 in 85.7% of the queries in the non-weighted sample and in 81.4% of the queries in the weighted sample. The BM25 algorithm combined with query preprocessing performed similarly to the semantic search in the AP@5 and AP@10 metrics.

Conclusions: Semantic search with embeddings from LLMs seems to perform better than BM25 and Levenshtein distance for retrieving ICPC-2 codes in the Brazilian Portuguese language. It is a promising approach for aiding healthcare professionals in medical coding. The BM25 algorithm combined with query preprocessing is an interesting option that can perform similarly to semantic search in some metrics, although it has some additional limitations.

Keywords: International Classification of Primary Care (ICPC); electronic health records (EHRs); ranking algorithms; natural language processing

Received: 20 September 2024; Accepted: 07 February 2025; Published online: 14 March 2025.

doi: 10.21037/jmai-24-341

Highlight box

Key findings

• Semantic search using Large Language Model (LLM) embeddings outperformed BM25 and Levenshtein distance algorithms in all metrics for retrieving International Classification of Primary Care, 2^nd edition (ICPC-2) codes in Brazilian Portuguese.

• The BM25 algorithm, when combined with preprocessing, performed similarly to semantic search in specific metrics such as AP@5 and AP@10.

• OpenAI’s text-embedding-3-large model was the best performing among the tested LLM models in most metrics.

What is known and what is new?

• ICPC-2 is widely used in primary care, but healthcare professionals face difficulties in finding appropriate codes using existing search engines.

• This manuscript demonstrates that modern semantic search techniques, particularly those leveraging LLM embeddings, offer a more efficient and accurate solution for medical coding.

What is the implication, and what should change now?

• Improving search engines in electronic health records by incorporating semantic search methods could significantly enhance medical coding accuracy and reduce the cognitive load on healthcare professionals.

• Future developments should focus on testing these methods in other languages and expanding the dataset to include more real-world queries.

Introduction

Background

Medical coding is an essential procedure to collect structured data in healthcare delivery. It can assist healthcare professionals in better understanding their patients’ needs. Healthcare institutions and systems can use it to monitor the health of populations and to make data-driven decisions based on real-world data. Research teams can also use it in several ways, including estimating the incidence and prevalence of diseases (1), pre- and post-test probabilities to aid diagnosis (2), understanding the reasons why people seek medical assistance (3), and comparing data internationally (4).

Rationale and knowledge gap

The International Classification of Primary Care (ICPC) is widely used in primary care settings. Initially, it was developed to code reasons for healthcare encounters, since the International Classification of Diseases (ICD) was designed for morbidity and mortality statistics. Also, the applicability of the ICD in primary care settings was demonstrated to be poor, due to low agreement rates between different coders (5). Even in a simulated setting, only 56% of the codes selected by the professionals were considered appropriate, one-fourth of the conditions were omitted, and that rate increased to 38% in secondary diagnoses (6). The ICPC can be applied to a wide range of prevalent conditions in primary care while maintaining conciseness. It includes rubrics for undifferentiated symptoms, procedures, and non-disease-related conditions, such as social context, risk factors, and codes related to healthy individuals. ICPC second edition is available in 19 languages, was validated in several countries, and is widely used by primary care providers across Europe and Australia (7). It is as applicable in high-income countries as in middle and low-income ones. These characteristics make it ideal for daily use in primary care settings worldwide (3).

The agreement rate between different coders using ICPC-2 is considered good with the aid of the ICPC-2/ICD-10 thesaurus. Among experts, it can reach 97.9% at the chapter level and 95.6% at the rubric level. Among professionals without previous training, it can reach 74.5% (8). These agreement rates were achieved with all the coding being done separately from all the other tasks healthcare professionals are responsible for. In practice, medical coding is done in a time-constrained environment while managing more complex tasks, such as making a diagnosis or prescribing treatments. To the best of our knowledge, there is no data available regarding the quality of medical coding when it is done in the daily work context. We assume that, without specific interventions, including constant monitoring, training, and feedback to professionals, coding quality decreases, especially in busy clinical settings.

Nowadays, electronic health records (EHRs) are widespread, and ICPC-2 is generally available through a search engine in the same system. Due to all other responsibilities, healthcare professionals cannot dedicate much attention to medical coding. Only 10% of professionals attempt different query strategies when trying to find the best code and many of them report difficulty in finding the best code (6). In practice, those search engines generally have poor quality results, increase the time spent with medical coding, and do not help professionals choose the most adequate codes.

Objective

The goal of this study is to evaluate three different information retrieval algorithms that can be used in EHR search engines to find the best ICPC-2 given a query. The selected algorithms included BM25 (9) and Levenshtein distance (10), that are still widely used and extended (11,12), and embedding models that recently emerged from Large Language Models (LLMs), adding new capabilities to information retrieval systems (13). We expect that the results of this study may help improve the quality of medical coding in EHRs and all the activities that depend on it.

Methods

ICPC-2 thesaurus

The ICPC-2/ICD-10 thesaurus (14) was developed through a collaboration between a group of Belgian researchers on behalf of the Ministry of Health of Belgium and researchers from the Department of Family Medicine of the University of Amsterdam. It was released with the book “International Classification of Primary Care, Revised Second Edition (ICPC-2-R)”, published in 2005 by Oxford University Press (7). The Brazilian Portuguese translation used in this study was published in 2010 by the Brazilian Society of Family and Community Medicine (SBMFC) in partnership with the Ministry of Health of Brazil (15).

The thesaurus provides a mapping between clinical concepts and ICPC-2/ICD-10 codes. It was used in this study to provide the data needed to implement the different information retrieval algorithms Using the relationships defined in the thesaurus, a corpus with 73,563 entries was constructed as the foundation for all the search algorithms evaluated and is available as https://cdn.amegroups.cn/static/public/jmai-24-341-1.xlsx. The original thesaurus is available as https://cdn.amegroups.cn/static/public/jmai-24-341-2.xlsx. Each entry in the corpus was defined by an ICPC-2 code, an expression and a relation type that describes where the expression came from. The relation types are:

ciap_to_text: maps Brazilian Portuguese expressions to their corresponding ICPC-2 code. This accounts for 48,517 entries (66.0% of the corpus).
ciap_to_ciap: maps the ICPC-2 code as text to their corresponding ICPC-2 code. This allows the user to find the right ICPC-2 code if they enter the code directly in the search engine. This accounts for 726 entries (1.0% of the corpus).
ciap_to_ciap_title: maps the title of the ICPC-2 code to their corresponding ICPC-2 code. This allows the user to find the right ICPC-2 code if they enter the code’s title. This accounts for 726 entries (1.0% of the corpus).
ciap_to_cid: maps the most frequent ICD-10 code column of the thesaurus to their corresponding ICPC-2 code, which allows the user to find the right ICPC-2 if they enter an ICD-10 code. This accounts for 18,000 entries (24.5% of the corpus).
ciap_to_other_cids: maps alternative ICD-10 codes columns of the thesaurus to their corresponding ICPC-2 code, which also contributes to finding ICPC-2 codes from ICD-10 codes. This accounts for 5,594 entries (7.6% of the corpus).

This corpus serves as the backbone for evaluating and benchmarking the search algorithms presented in this study. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The study was approved by the institutional ethics committee of Hospital das Clínicas da Faculdade de Medicina da Universidade de São Paulo (HCFMUSP) (No. 70023923.3.0000.0068) and informed consent was not necessary from the users that interacted with the search engine as no personal identifiable information was collected.

Information retrieval algorithms

Information retrieval algorithms aim at finding the most relevant documents in a dataset for a given query. Three different algorithms were evaluated in this study: BM25, Levenshtein distance, and semantic search with embeddings from LLMs.

BM25

The BM25 algorithm was first published in 1994 (9). BM25 remained the standard ranking algorithm for many years, with near human-level performance (16). This method considers term frequency in the query, term frequency in the documents, and document length to attribute relevance scores to each document in the given dataset (9). The BM25 implementation used is available in the Python package “rank-bm25 0.2.2”. This approach was evaluated either with or without query preprocessing. The preprocessing involved two simple steps: lowercasing (transforming every letter in the query to its lowercase form) and special characters removal. Some examples of special character removal include transforming ‘ç’ into ‘c’, ‘à’ into ‘a’, ‘â’ into ‘a’, ‘ã’ into ‘a’, etc. This was achieved using a Python library called “unidecode”. It allows transliterations from Unicode text, which can have several symbols, to American Standard Code for Information Interchange (ASCII) characters, an encoding standard for electronic communication that has a very limited number of possible characters (128 with only 98 printable ones). This kind of preprocessing ensures that the retrieval algorithm will be resilient to certain potential typos and upper/lowercasing patterns, increasing the chances of a match: ‘Coração’, ‘coracão’ and ‘coraçao’ are all equivalent after preprocessing, resulting in ‘coracao’.

Levenshtein distance

Levenshtein edit distance was developed by Vladimir I. Levenshtein and published in his most cited paper in 1966 (10). It is an algorithm that compares two sequences and defines the minimal number of edits to convert one sequence into another, including deletions, insertions, and substitutions. It is still relevant nowadays in many fields, such as in bioinformatics (12), and can also be used as an information retrieval algorithm. We used the implementation of this algorithm available at the Python package “polyleven 0.8”. To perform the retrieval, the Levenshtein distance between the entire query string and each entry in the corpus was calculated. The results were then sorted in ascending order, with the smallest distances (most similar) at the top.

Semantic search with embeddings from LLMs

LLMs are pre-trained on enormous datasets in an unsupervised manner. In that process, these models learn representations of text data that capture abstract aspects of language, such as meaning and concepts that go across various expressions and languages. These numerical representations of data are generally called dense vectors or embeddings and can be used in information retrieval tasks. The relevance of documents in this scenario is measured by the semantic similarity between the query and the available documents (17). The cosine similarity was used to calculate the semantic similarity between two vectors. As the retrieval mechanism, the Hierarchical Navigable Small World graphs algorithm was used, a technique introduced by Malkov and Yashunin to efficiently compute approximate nearest neighbor search (18). The Chroma DB was used as the embedding database, and the open-source library “chromadb 0.4.22” was used to perform the retrieval.

Three different LLMs embedding Application Programming Interface (API) services were used: OpenAI, Cohere, and Gemini. OpenAI offers three different embedding models: text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large. Cohere offers several embedding models. In this study, we included only those that had the Portuguese language represented in the model training data. These were the following: embed-multilingual-v2.0 and embed-multilingual-v3.0 including the four different subtypes available (search_document, search_query, classification, and clustering). Gemini’s embedding model is offered by Google as embedding-001 and has subtypes (semantic_similarity, retrieval_query, retrieval_document, classification, clustering). All of them were included in the analysis.

Data collection

To evaluate the different information retrieval algorithms, a dataset mapping queries to their relevant documents is needed. For this purpose, a simple search engine was developed to help healthcare professionals find the ICPC-2 codes they needed. The app was made publicly available on the web and shared through social media with primary care providers. No login or registration was needed to use the tool. No data specific to each user was collected. Only data related to the queries was collected from April 2023 to October 2023. A screenshot of the search engine is shown and described at Figure 1.

Figure 1 Screenshot of the search engine that was publicly available during the study period. For each query, the results were presented below the search input element. Each result had the expression found to be similar to the query and additional details about that expression retrieved from the thesaurus, such as the ICPC-2 and ICD-10 codes, inclusion and exclusion criteria. This allowed the tool to be as useful as possible for users while contributing to our study purposes. In the hidden sidebar, users were able to select which information retrieval they preferred. ICPC-2, International Classification of Primary Care, 2^nd edition; ICD-10, International Classification of Diseases, 10^th Revision; CIAP 2, Classificação Internacional de Atenção Primária-Segunda Edição.

In total, 11,868 queries were collected through the search engine. Of those, 7,671 (64.6%) were unique expressions. Then, a random sample of 437 (5.7% of the unique expressions) was encoded with ICPC-2 through peer review.

Two family physicians with at least 8 years of experience (including residency) with medical coding using ICPC-2 independently mapped the sampled queries to their relevant ICPC-2 codes. In the context of medical coding, relevant codes refer to the list of codes that would be considered appropriate results for a given query. Disagreements between both reviewers were resolved through follow-up meetings until a consensus was reached. Also, the following rules were followed for the data labeling:

Unclear queries were removed.
If more than one interpretation for the query was possible, all codes related to those interpretations were included as relevant.
If the query was too unspecific, unspecified codes were preferred and inferences about what the user was looking for were avoided.
All official information available about ICPC-2 in Portuguese was considered for code selection (ICPC-2 manual, thesaurus, summary sheet) (15).
Non-Portuguese queries were excluded.
Typographical errors were maintained.

Some queries included random letters or words without any meaningful context. In these cases, selecting relevant ICPC-2 codes was not possible and they were removed from analysis. Examples: ‘preferecia’, ‘as’, ‘obs’. At the end, 39 unclear queries were removed from the sampled queries (3 for being in other languages, 36 for being too unspecific or unintelligible). Each reviewer annotation and the final decision about relevant codes after discussion is available in the https://cdn.amegroups.cn/static/public/jmai-24-341-3.xlsx and in the code repository at GitHub.

Queries could have one or more interpretations. As an example, for a query like “cough”, there is only one relevant ICPC-2 code, which is R05. Some queries have more than one, such as “chest pain”, which can be interpreted as chest pain attributed to the heart (K01, heart pain), a musculoskeletal condition (L04, chest symptom/complaint), a respiratory condition (R01), or an unspecified symptom (A11). Other queries included the code itself to retrieve more information about it. In these cases, when the query was an ICPC-2 code or an ICD-10 code, the relevant ICPC-2 codes assigned to that query were the ones associated with the ICPC-2, or ICD-10 code queried. For example, if the query is “I10”, which can be the ICD-10 code for “Essential (primary) hypertension”, the expected ICPC-2 code is K86, hypertension uncomplicated.

Evaluation metrics

Two metrics designed to evaluate ranked lists were used.

Precision at k (P@k)

Precision refers to the ratio of the relevant items retrieved to the total number of items retrieved. The letter k refers to the number of top-ranked items to be considered when calculating the precision. In this metric, what matters is the number of relevant documents in the top k documents retrieved, regardless of the order in which they appear (19).

Average precision at k (AP@k)

This metric extends the concept of P@k to consider the order in which the relevant documents appear. It is calculated as the average of the precision of each relevant document considering the position in which they appear and only considering relevant documents in the top-k positions (19).

These metrics were computed for the top 1, 5, and 10 results. The sample was evaluated as it is and weighted based on the frequency distribution of the queries. Also, the time spent on the computation of each algorithm was recorded and analyzed.

Statistical analysis

Inter-annotator agreement

To report the agreement rate between the two annotators of the evaluation dataset, Cohen’s kappa statistic (20) was adapted to multi-label annotation in which the number of labels per item varies. This adaptation was needed because, originally, Cohen’s kappa was designed to estimate inter-rater agreement between two reviewers in a binary classification task. Additionally, it was important to account for partial agreements in our case: the two annotators can agree on adding two codes as relevant, but they disagree about adding another one. The kappa statistic has been adapted previously for multi-label tasks (21), but these adaptations did not account for a variable number of labels per item or no labels at all.

To account for partial agreements, we used the Jaccard index (22), which is well-suited for our task, as it measures the similarity between two sets with variable lengths. It is defined by:

$J (A, B) = \frac{| A \cap B |}{| A \cup B |}$ [1]

where A and B are the sets of codes assigned by the reviewers to a given query. The Jaccard index results in a value between 0 and 1 and can be interpreted as a measure of overlap between two sets, resulting in 0 for no overlap and 1 for total overlap. For queries where both reviewers assigned no codes, the Jaccard index was set to 1.0, reflecting perfect agreement. Kappa’s observed agreement [ $p (O)$ ] was computed as the average Jaccard index across all encoded queries.

Kappa’s expected agreement [ $p (E)$ ] was calculated based on each code’s frequency and the probability of the two reviewers independently selecting the same codes by chance. Code frequency was defined as the proportion of queries containing each code assigned by each reviewer:

$P_{c o d e, r e v i e w e r} = \frac{N u m b e r o f q u e r i e s c o n t a i n i n g t h e c o d e}{T o t a l n u m b e r o f q u e r i e s}$ [2]

The expected agreement was then computed as the sum of the probabilities of each code being independently assigned to a query by both reviewers:

$p (E) = \sum_{i = 1}^{N} P_{c o d e i, r e v i e w e r 1} \cdot P_{c o d e i, r e v i e w e r 2}$ [3]

Finally, the kappa statistic (κ) was calculated using:

$κ = \frac{p (O) - p (E)}{1 - p (E)}$ [4]

where $p (O)$ is the observed agreement and $p (E)$ is the expected agreement. According to McHugh (23), the agreement rate measured by the kappa statistic can be interpreted as none (0.00–0.20), minimal (0.21–0.39), weak (0.40–0.59), moderate (0.60–0.79), strong (0.80–0.90), or almost perfect (>0.90).

Hypothesis testing

For initial hypothesis testing, considering the sample size, the P@k and AP@k metrics from each model were submitted to the parametric test analysis of variance (ANOVA) one-way (24) to define whether there was a significant difference between the models evaluated in each metric or not. Post-hoc analysis was conducted with Tukey’s test (25) for pairwise comparison between models to assess which models performed significantly better than others in each metric. The alpha level considered was 0.05 to define statistical significance. The smallest difference considered to be important was arbitrarily set to 0.05 either in P@k or AP@k.

To compare the rate at which each model had a relevant result in the top 10 results, since it refers to the comparison of two categorical variables (model type and has/has not a relevant result in the top 10), the Chi-squared test was used (26). The number of degrees of freedom (df) is determined by the following formula:

$d f = (N u m b e r o f R o w s - 1) x (N u m b e r o f C o l u m n s - 1)$ [5]

In our case, with 16 algorithms/models being tested and two columns (has/has not a relevant result in the top 10), df is 15.

Hardware

This study was conducted on a desktop equipped with a 13^th Gen Intel(R) Core(TM) i9-13900K (3.00 GHz, 24 cores), 64 GB DDR5 RAM, and a NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM. The system had a 4 TB Kingston SSD for storage and Windows 11 Pro as the operating system.

Source code

All study data and source code were made available through a GitHub repository to make it easier to reproduce the results. It is available at: https://github.com/almeidava93/icpc_ir_paper.

Results

From April 2023 to October 2023, 11,868 queries were collected through the search engine. Of those, 7,671 (64.6%) were unique expressions. Then, a random sample of 437 (5.7% of the unique expressions) was evaluated. Two reviewers experienced with ICPC-2 in daily practice independently selected a list of codes for each query. After the annotation process, they discussed all the disagreements until a consensus was reached. The agreement rate between the two reviewers measured with the kappa statistic was 0.64, which can be considered moderate according to McHugh (23). Following the consensus, 39 unclear queries were removed from the sampled queries (3 for being written in other languages, 36 for being too unspecific or unintelligible). Finally, 398 annotated queries (5.2%) were included in the evaluation dataset.

The queries in the evaluation dataset were used to perform a retrieval using each one of the algorithms. The retrieved results from each algorithm and model are available as https://cdn.amegroups.cn/static/public/jmai-24-341-4.xlsx. P@k and AP@k were computed and aggregated with simple mean (Table 1) and with weighted mean (Table 2). These tables also include a rate at which at least one relevant document was retrieved considering the top 10 results. Both weighted and non-weighted samples were submitted to the ANOVA one-way test in order to determine if the differences seen are statistically significant.

Table 1

Information retrieval results aggregated with simple mean

Algorithm/embedding model	P@1*	P@5*	P@10*	AP@1*	AP@5*	AP@10*	%*
BM25
Without preprocessing	0.348	0.240	0.194	0.347	0.388	0.384	53.8
With preprocessing	0.573	0.350	0.268	0.573	0.596	0.581	74.6
Levenshtein distance	0.412	0.186	0.125	0.412	0.436	0.420	56.5
OpenAI embeddings
Text-embedding-ada-002	0.613	0.430	0.351	0.613	0.633	0.600	78.9
Text-embedding-3-small	0.616	0.428	0.357	0.616	0.639	0.597	78.4
Text-embedding-3-large	0.709^†	0.530^†	0.449^†	0.709^†	0.727^†	0.682^†	85.7^†
Cohere embeddings
Embed-multilingual-v2.0	0.598	0.404	0.320	0.598	0.611	0.564	76.1
Embed-multilingual-v3.0 search_document	0.573	0.314	0.235	0.573	0.593	0.555	72.1
Embed-multilingual-v3.0 search_query	0.646	0.430	0.340	0.646	0.663	0.622	78.9
Embed-multilingual-v3.0 classification	0.643	0.407	0.323	0.643	0.658	0.614	78.9
Embed-multilingual-v3.0 clustering	0.641	0.423	0.334	0.641	0.664	0.623	78.9
Gemini embeddings
Embedding-001 semantic_similarity	0.548	0.334	0.278	0.548	0.576	0.539	72.9
Embedding-001 retrieval_query	0.550	0.337	0.277	0.550	0.575	0.538	72.9
Embedding-001 retrieval_document	0.530	0.317	0.248	0.530	0.550	0.519	68.8
Embedding-001 classification	0.553	0.332	0.268	0.553	0.576	0.542	72.4
Embedding-001 clustering	0.548	0.336	0.268	0.548	0.568	0.537	71.4

The % column refers to the frequency in which at least one relevant document was retrieved in the top 10 results. ^†, higher values. An asterisk (*) was used to mark metrics with a P value <0.05 in the hypothesis test. P@, precision at; AP@, average precision at.

Table 2

Information retrieval results aggregated with weighted mean

Algorithm/embedding model	P@1*	P@5*	P@10*	AP@1*	AP@5*	AP@10*	%*
BM25
Without preprocessing	0.404	0.252	0.189	0.404	0.436	0.432	56.3
With preprocessing	0.603	0.369	0.266	0.603	0.623	0.612	75.0
Levenshtein distance	0.411	0.166	0.114	0.411	0.427	0.408	52.1
OpenAI embeddings
Text-embedding-ada-002	0.613	0.375	0.301	0.613	0.607	0.590	81.7^†
Text-embedding-3-small	0.583	0.373	0.294	0.583	0.579	0.541	74.2
Text-embedding-3-large	0.676	0.465^†	0.380^†	0.676	0.703^†	0.658^†	81.4
Cohere embeddings
Embed-multilingual-v2.0	0.530	0.363	0.281	0.530	0.533	0.489	66.3
Embed-multilingual-v3.0 search_document	0.589	0.300	0.209	0.589	0.601	0.562	70.0
Embed-multilingual-v3.0 search_query	0.679^†	0.417	0.320	0.679^†	0.667	0.627	76.1
Embed-multilingual-v3.0 classification	0.647	0.366	0.288	0.647	0.635	0.592	73.4
Embed-multilingual-v3.0 clustering	0.674	0.394	0.303	0.674	0.670	0.630	75.9
Gemini embeddings
Embedding-001 semantic_similarity	0.572	0.346	0.263	0.572	0.570	0.534	71.1
Embedding-001 retrieval_query	0.558	0.337	0.260	0.558	0.553	0.514	69.1
Embedding-001 retrieval_document	0.520	0.285	0.223	0.520	0.545	0.495	64.6
Embedding-001 classification	0.542	0.307	0.241	0.542	0.546	0.515	67.6
Embedding-001 clustering	0.528	0.321	0.238	0.528	0.541	0.515	66.3

The % column refers to the frequency in which at least one relevant document was retrieved in the top 10 results. ^†, higher values. An asterisk (*) was used to mark metrics with a P value <0.05 in the hypothesis test. P@, precision at; AP@, average precision at.

ANOVA one-way was used to detect differences between each algorithm/model performance in each metric. It returns a P value that indicates the presence of a significant difference between algorithms, but does not return specifically which differences were significant. To unveil that information, a post hoc analysis (in our case, Tukey’s test) allows a pairwise comparison. That means that for each metric (P@1, P@5, P@10, AP@1, AP@5, AP@10), every possible pair of algorithms/models (16 in our case) will be compared and will return a P value, that finally states if that comparison is significant or not. This process was done both for the weighted sample and for the non-weighted sample. The frequency of each query is available in the https://cdn.amegroups.cn/static/public/jmai-24-341-5.xlsx and it was used to compute the weighted sample metrics. In total, that analysis consists of 12 tables with 120 rows, each one with a P value. In order to keep conciseness, the ANOVA one-way test results are reported at Tables 3,4, the post hoc analyses are summarized below, and all tables are available as worksheets in the https://cdn.amegroups.cn/static/public/jmai-24-341-6.xlsx (non-weighted sample) and https://cdn.amegroups.cn/static/public/jmai-24-341-7.xlsx (weighted sample).

Table 3

Results of the ANOVA one-way test in the non-weighted sample

Metric	F-statistic	P value	η²
P@1	13.13	<0.001	0.03
P@5	25.28	<0.001	0.06
P@10	28.24	<0.001	0.06
AP@1	13.13	<0.001	0.03
AP@5	15.11	<0.001	0.03
AP@10	13.79	<0.001	0.03

η², eta-squared effect size measure. ANOVA, analysis of variance; P@, precision at; AP@, average precision at.

Table 4

Results of the ANOVA one-way test in the weighted sample

Metric	F-statistic	P value	η²
P@1	40.60	<0.001	0.03
P@5	68.99	<0.001	0.04
P@10	75.15	<0.001	0.05
AP@1	40.60	<0.001	0.03
AP@5	43.17	<0.001	0.03
AP@10	42.80	<0.001	0.03

η², eta-squared effect size measure. ANOVA, analysis of variance; P@, precision at; AP@, average precision at.

The rate of at least one relevant document retrieved was analyzed with the Chi-squared test to determine if the differences in the proportions are statistically significant. The Chi-squared test results are presented in Table 5. The time spent in the retrieval for each algorithm and query was also recorded and is summarized in Table 6.

Table 5

Results of the Chi-squared test comparing the rate of retrieval of relevant results in the top 10 results across the information retrieval algorithms tested

Sample	Chi-squared statistic	P value	Degrees of freedom
Not weighted	203.4	<0.001	15
Weighted	651.3	<0.001	15

Table 6

Time to retrieve mean and standard deviation for each model tested

Algorithm/embedding model	Mean (seconds)	Standard deviation (seconds)
BM25
Without preprocessing	0.014	0.015
With preprocessing	0.014	0.015
Levenshtein distance	0.022	0.003
OpenAI embeddings
Text-embedding-ada-002	0.302	0.368
Text-embedding-3-small	0.258	0.042
Text-embedding-3-large	0.353	0.071
Cohere embeddings
Embed-multilingual-v2.0	0.469	0.151
Embed-multilingual-v3.0 search_document	0.418	0.392
Embed-multilingual-v3.0 search_query	0.424	0.244
Embed-multilingual-v3.0 classification	0.429	0.223
Embed-multilingual-v3.0 clustering	0.488	0.264
Gemini embeddings
Embedding-001 semantic_similarity	0.301	0.049
Embedding-001 retrieval_query	0.297	0.026
Embedding-001 retrieval_document	0.296	0.025
Embedding-001 classification	0.298	0.017
Embedding-001 clustering	0.293	0.013

Post-hoc analysis was performed with Tukey’s test for pairwise comparison between each model in each metric to determine how each algorithm compares with one another. All the results are available in the https://cdn.amegroups.cn/static/public/jmai-24-341-6.xlsx (non-weighted sample) and https://cdn.amegroups.cn/static/public/jmai-24-341-7.xlsx (weighted sample), and relevant aspects of it will be described here.

Post-hoc analysis in the unweighted sample

The OpenAI embedding model text-embedding-3-large had the best performance in all metrics, but that difference was not statistically significant in all comparisons.

P@1 and AP@1: the model text-embedding-3-large result revealed a P<0.05 when compared with the BM25 algorithm (with and without preprocessing), the Levenshtein distance algorithm, the Cohere embedding model embed-multilingual-v3.0 search_document, and all Gemini embedding models.
P@5: the model text-embedding-3-large result revealed a P<0.05 in all comparisons.
P@10: the model text-embedding-3-large result revealed a P<0.05 in all comparisons.
AP@5: the model text-embedding-3-large result revealed a P<0.05 when compared with the BM25 algorithm (with and without preprocessing), the Cohere embedding models embed-multilingual-v2.0 and embed-multilingual-v3.0 search_document, and all the Gemini models.
AP@10: the model text-embedding-3-large result revealed a P<0.05 when compared with the BM25 algorithm (with and without preprocessing), the Levenshtein distance algorithm, the Cohere embedding model embed-multilingual-v2.0 and embed-multilingual-v3.0 search_document, and all the Gemini models.

Post-hoc analysis in the weighted sample

The OpenAI embedding model text-embedding-3-large had the best performance in P@5, P@10, AP@5, and AP@10, but that difference was not statistically significant in all comparisons. The Cohere embedding model embed-multilingual-v3.0 clustering had the best performance in P@1 and AP@1, but also the differences were not statistically significant in all comparisons.

P@1 and AP@1: the Cohere model embed-multilingual-v3.0 clustering result revealed a P<0.05 when compared with the BM25 algorithm (with and without preprocessing), the Levenshtein distance algorithm, the OpenAI model text-embedding-small, the Cohere models embed-multilingual-v2.0 and embed-multilingual-v3.0 search_document, and all the Gemini models.
P@5: the model text-embedding-3-large result revealed a P<0.05 in all comparisons.
P@10: the model text-embedding-3-large result revealed a P<0.05 in all comparisons.
AP@5: the model text-embedding-3-large result revealed a P<0.05 when compared with the BM25 algorithm (with and without preprocessing), the Levenshtein distance algorithm, the OpenAI models text-embedding-ada-002 and text-embedding-small, the Cohere models embed-multilingual-v2.0, embed-multilingual-v3.0 search_document, and embed-multilingual-v3.0 classification, and the Gemini models.
AP@10: the model text-embedding-3-large result revealed a P<0.05 when compared with the BM25 algorithm without preprocessing, the Levenshtein distance algorithm, the OpenAI models text-embedding-ada-002 and text-embedding-small, the Cohere models embed-multilingual-v2.0, embed-multilingual-v3.0 search_document, and embed-multilingual-v3.0 classification, and all the Gemini models.

Discussion

Key findings

This study investigated the performance of three different methodologies for information retrieval to aid healthcare professionals in finding adequate ICPC-2 codes.

The results show that semantic search using LLMs has a better performance in most metrics analyzed when compared to BM25, and Levenshtein distance. This finding reinforces the fact that LLM embeddings are capable of more abstract representations of meaning that do not depend on the exact words. This also makes it more resilient to typos and different writing styles.

The BM25 algorithm with preprocessing performed similarly to the best LLM model in some metrics (AP@5 and AP@10 in the weighted and non-weighted samples). This reveals that systems can have big improvements in information retrieval performance just by combining the BM25 algorithm with simple preprocessing techniques, such as lowercasing and special characters removal. This finding refers to Brazilian Portuguese and needs testing in other languages. Languages that commonly use special characters may benefit more from this approach. Other advantages include that the BM25 algorithm is open-source, and it is 30 to 40 times faster to compute when compared to embedding models from third-party services. The main limitation of this approach is its inability to deal with typos and synonyms.

Of all the LLMs embeddings tested, text-embedding-3-large from OpenAI performed best in all metrics when considering simple mean aggregation of the results. When considering the frequency of each query in the search engine, Cohere’s embedding model reaches the best result in P@1 and AP@1, while OpenAI’s text-embedding-3-large achieves the best results in the other metrics. OpenAI embedding models also were the most performant when considering the rate of having at least one relevant document in the top 10 results. On this metric in the unweighted sample, text-embedding-3-large performed better and retrieved at least one relevant result in the top 10 results in 85% of the queries. In the weighted sample, text-embedding-ada-002 was the best one and retrieved at least one relevant result in 83% of the queries.

The average time to perform the retrieval increases approximately 30- to 40-fold when comparing BM25 or Levenshtein with LLMs embeddings. Although the processing time increases, the performance also increases by 96% when comparing BM25 P@1 without preprocessing with OpenAI model text-embedding-3-large. The mean time for retrieval among LLMs embeddings was 0.3560 seconds, which can still offer a good user experience in an EHR while decreasing the need for multiple queries.

Strengths and limitations

This study has several strengths. The official thesaurus in Brazilian Portuguese was used. This guarantees that the mapping between codes and expressions has been thoroughly revised. It was created by the WONCA International Classification Committee, which is responsible for the creation and development of the ICPC, in partnership with the SBMFC (15). The sample of queries that were used to evaluate the algorithms was made of real-world data generated by healthcare providers while using the search engine. It reflects the most common doubts that providers have while looking for ICPC-2 codes and shows the importance of having a powerful search engine since 64.6% of the total number of queries were new and unique expressions.

The evaluation dataset was annotated by two family physicians with at least 8 years (including residency) of experience with ICPC-2. They achieved moderate agreement rate (κ=0.64) in independent annotation, and solved all disagreements in posterior sessions. To the best of our knowledge, this is the first study to compare different ranking algorithms in the context of ICPC-2 coding and to investigate which approach is better for facilitating medical coding. Considering all methods studied, the average retrieval times were below 500 ms, which is generally acceptable for real-time applications and is a strength of the algorithms evaluated. All the code and the data were shared online to allow easy reproduction of our results. Finally, it is the first peer-reviewed dataset of ICPC-2 queries in Brazilian Portuguese made available for research purposes.

As a limitation, the labeled dataset used for the evaluation of the information retrieval algorithms was small, since the data annotation and peer review process are very time-consuming. However, the already collected data provides a basis for expansion in future studies. Finally, only Brazilian Portuguese was studied. Since there is no specific information about the users, the expressions collected in the search engine may not be representative of the specificities of the Portuguese language from different regions in Brazil. The performance of each LLM’s embeddings depends on many aspects of their pre-training, including how well each language was represented in the training data. Therefore, the results presented here may not apply to different languages. Finally, the hardware used in the analysis was powerful compared to the ones commonly available in general practice settings. Therefore, the computation time measured may vary depending on the hardware available, the internet connection while interacting with online services, and the availability of third-party services.

Comparison with similar research

To the best of our knowledge, there is no study in the literature focused on comparing different information retrieval strategies for finding ICPC-2 codes.

One study focused on ICD-10 codes retrieval (27) has some similarities to this research. A list of encoded expressions by specialists was available to be used as an evaluation dataset. The dataset of 6,787 entries had mappings between expressions and ICD-10 codes and was reviewed by terminologists. There is no such dataset in Brazilian Portuguese with ICPC-2 mappings publicly available for research that can be used as an evaluation dataset for search engines. For this reason, our search engine was made publicly available to collect real world queries and those were labeled for the evaluation phase. Another similarity is that a rich terminology dataset with mappings between expressions and ICD-10 codes was used as corpus for the search engine. In our case, we used the official Brazilian Portuguese ICPC-2 thesaurus (15).

There are, however, several differences worth mentioning. The main purpose of that study was to compare the performance of the retrieval applied to different corpuses. The authors developed different versions of the search engine with the same algorithms, but based on different datasets of mappings between expressions and ICD-10 codes. They also used an expansion of an already existing corpus that included terms frequency, stop words, synonyms, collocations, abbreviations and frequent typographical errors, which was not available for ICPC-2 in Brazilian Portuguese. Our main purpose was to compare different information retrieval algorithms, not different corpuses.

Another key difference was the evaluation metric used. They used recall @k, which measures the proportion of relevant documents retrieved out of all possible relevant documents (28). In other words, it results in higher scores by including as many relevant items in the search results as possible, which, in our view, is not the main purpose of a search engine designed to aid medical coding. When looking for ICPC-2 codes, showing adequate codes at the top of the results is more relevant than being able to retrieve all possible codes for a given query. Also, showing irrelevant codes at the top hinders user experience. Rarely users do look at extensive lists of results or try different queries to find a specific code (6). Considering this, P@k and AP@k are more appropriate to our study objectives since they give higher scores to relevant results at the top.

Finally, the authors employed frequency-based techniques to avoid ambiguities in expressions mapped to multiple ICD-10 codes. In our study, we chose to not perform any frequency-based prioritization. In ICPC-2, the same expression can be interpreted differently depending on the patient’s context. For example, “chest pain” can be associated with different ICPC-2 codes, such as A11, K01, R01 or L04. A healthcare professional using the search engine might choose K01 if the patient indicates that their pain is heart-related or if the professional assesses it as such. Alternatively, they might select L04 if the pain appears to come from musculoskeletal, or R01 if the pain is linked to the respiratory system. Finally, A11 might be chosen if the pain cannot be clearly attributed to the heart, musculoskeletal system, or respiratory system. In this context, prioritizing expression-code mappings based on frequency or other methods would not be meaningful to this research.

Many research groups are interested in automating medical coding, but few focus specifically on retrieval systems aimed at medical coding. It is clear that LLMs cannot be used for medical coding directly, because they show poor performance and a risk of hallucinating codes or false medical information (29).

One study worked on ICD-10 coding automation using a ‘Retrieve-Rank’ approach (30). First, instead of relying on prompt engineering to generate ICD-10 codes with LLMs, the authors trained a transformer model (ColBERT-V2) with ICD-10-CM data. Second, the model was used to generate embeddings and retrieve relevant ICD-10 codes for 100 pre-encoded queries with ICD-10-CM. Third, a LLM (GPT-3.5-turbo) was used to rerank the retrieved results. This technique showed state-of-the-art performance in three benchmarks targeted to extreme multiclass classification tasks (31), but this study was one of the first attempts to apply it to medical coding. Although it reported 100% accuracy in the model classification task, no evaluation step was applied to the retrieval system, for example, assessing P@k, recall @k, f1-score or other metrics. Additionally, it is not clear if the evaluation dataset overlapped with the training data or not. That would indicate an important study limitation since the accuracy found could result from overfitting and end up with limited generalizability. In contrast, our study compared different information retrieval algorithms that require no additional training. Our evaluation dataset was created with real-world queries generated by users, which were not used to develop the search engine. It is worth noting that most queries we collected (64.6%) were unique expressions, many including typos, synonyms and terms absent from the thesaurus.

There are several techniques that combine algorithms, leverage neural networks pretraining and fine-tuning to achieve better results in information retrieval tasks.

The traditional BM25 algorithm has been explored in several ways to address its limitation of being based on exact string matching, which lacks semantic similarity assessment. BMX algorithm (11), for example, incorporates many strategies to achieve better performance. Entropy-weighted similarity is used to define word relevance based on their contribution to distinguishing relevant documents, giving more weight to less frequent but informative terms. Weighted query augmentation leverages LLMs to expand queries with synonyms or related terms. The relevance of the generated queries is determined by the LLM confidence in generation and how similar the generated terms are with the original query. Score normalization is used to facilitate threshold definition in information retrieval tasks. By adding these techniques to the original BM25 algorithm, BMX achieved better results in multiple benchmarks and comparable performance to embedding models. This technique was not applied yet to medical coding tasks.

Training and/or fine-tuning neural networks was out of the scope of this paper. In future research, new pretraining and fine-tuning techniques can be compared with the readily available methods evaluated in this article.

Medical coding automation is a natural next step for the research involving medical coding. We believe that strong information retrieval systems will be part of that step, and this study contributes with a baseline benchmark for information retrieval algorithms applied to ICPC-2 codes.

Implications and actions needed

For future research, the first task is to expand the labeled queries dataset by labeling more data in a peer-reviewed manner and sharing the search engine with healthcare providers in different parts of the country. The second task is to extend the study to evaluate how these models perform in other languages. Thirdly, new models and new approaches to semantic search are arising and can also be evaluated in the context of medical coding.

Conclusions

Medical coding is an essential process for collecting interpretable data from any healthcare system and can be used as the basis for several data-driven interventions in different contexts, such as research, healthcare services, and entire healthcare systems.

This study investigated three different approaches to help healthcare professionals in medical coding with the ICPC-2, a widely used international classification for primary care settings. The results show that semantic search with LLM embeddings has generally the best performance in retrieving adequate ICPC-2 codes for primary care providers’ queries in Brazilian Portuguese. Compared to BM25 without any text preprocessing, the increase in performance measured with P@1 can reach 96%. From all embedding models evaluated, text-embedding-3-large from OpenAI had the best performance in this task. This approach seems promising for improving the quality of data collection in EHRs, and reducing professionals’ cognitive effort spent with medical coding. Also, the BM25 algorithm combined with query preprocessing is also a promising approach to ICPC-2 code retrieval, since it can reach a performance that is similar to semantic search in some metrics, has lower cost, and faster computation. It is limited, though, by not being able to deal with typos or synonyms.

Future research should include this evaluation with different languages, newer embedding models, an evaluation of how different strategies impact on the quality of data collection, and a bigger peer-reviewed dataset, preferably with data from different parts of the country so that all the variability in the language can be represented.

Acknowledgments

We thank Aline de Souza Oliveira, MD, for her contribution in the peer review of the evaluation dataset. We thank the reviewers of this manuscript for their insightful suggestions. We thank Gustavo Gusso, MD, PhD, for connecting the authors and being an inspiration for medical coding research in primary care.

Footnote

Data Sharing Statement: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-341/dss

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-341/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-341/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The study was approved by the institutional ethics committee of Hospital das Clínicas da Faculdade de Medicina da Universidade de São Paulo (HCFMUSP) (No. 70023923.3.0000.0068) and informed consent was not necessary from the users that interacted with the search engine as no personal identifiable information was collected.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Soler JK, Okkes I, Oskam S, et al. Revisiting the concept of 'chronic disease' from the perspective of the episode of care model. Does the ratio of incidence to prevalence rate help us to define a problem as chronic? Inform Prim Care 2012;20:13-23. [PubMed]
Soler JK, Corrigan D, Kazienko P, et al. Evidence-based rules from family practice to inform family practice; the learning healthcare system case study on urinary tract infections. BMC Fam Pract 2015;16:63. [Crossref] [PubMed]
Gusso GDF. Diagnóstico de demanda em Florianópolis utilizando a Classificação Internacional de Atenção Primária: 2o edição (CIAP-2). Universidade de Sao Paulo, Agencia USP de Gestao da Informação Acadêmica (AGUIA); 2010. Available online: 10.11606/T.5.2009.tde-08032010-16402510.11606/T.5.2009.tde-08032010-164025
Soler JK, Okkes I, Oskam S, et al. An international comparative family medicine study of the Transition Project data from the Netherlands, Malta and Serbia. Is family medicine an international discipline? Comparing incidence and prevalence rates of reasons for encounter and diagnostic titles of episodes of care across populations. Fam Pract 2012;29:283-98. [Crossref] [PubMed]
Wockenfuss R, Frese T, Herrmann K, et al. Three- and four-digit ICD-10 is not a reliable classification system in primary care. Scand J Prim Health Care 2009;27:131-6. [Crossref] [PubMed]
Horsky J, Drucker EA, Ramelson HZ. Accuracy and Completeness of Clinical Coding Using ICD-10 for Ambulatory Visits. AMIA Annu Symp Proc 2017;2017:912-20. [PubMed]
WONCA WICC. ICPC-2-R, International Classification of Primary Care, Revised Second Edition [Internet]. 2nd ed. Oxford University Press; 2005. Available online: http://wicc.news/wp-content/uploads/2019/01/Wonca-ICPC-2-R_OUP2005-complete2.pdf
Olagundoye OA, Malan Z, Mash B, et al. Reliability measurement and ICD-10 validation of ICPC-2 for coding/classification of diagnoses/health problems in an African primary care setting. Fam Pract 2018;35:406-11. [Crossref] [PubMed]
Robertson SE, Walker S, Jones S, et al. Okapi at TREC-3. In: Text Retrieval Conference [Internet]. 1994. Available online: https://api.semanticscholar.org/CorpusID:3946054
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady 1966;10:707-10.
LiXLippJShakirABMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search [Internet]. 2024. Available online: https://arxiv.org/abs/2408.06643
Berger B, Waterman MS, Yu YW. Levenshtein Distance, Sequence Comparison and Biological Database Search. IEEE Trans Inf Theory 2021;67:3287-94. [Crossref] [PubMed]
NeelakantanAXuTPuriRText and Code Embeddings by Contrastive Pre-Training [Internet]. 2022. Available online: https://arxiv.org/abs/2201.10005
Becker HW, Oskam SK, Okkes IM, et al. ICPC2-ICD10 Thesaurus: A diagnostic terminology for semi-automatic double coding in Electronic Patient Records. Amsterdam: Academic Medical Center/University of Amsterdam, Department of Family Medicine; 2005.
WONCA WICC, de Medicina de Família e Comunidade SBMFC. Classificação Internacional de Atenção Primária (CIAP 2) [Internet]. Sociedade Brasileira de Medicina de Família e Comunidade; 2010. Available online: http://www.sbmfc.org.br/wp-content/uploads/media/file/ciap/tesauro.xls
Trotman A, Keeler D. Ad hoc IR: not much room for improvement. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. (SIGIR ’11). ACM; 2011:1095-6.
Zhao WX, Liu J, Ren R, et al. Dense Text Retrieval based on Pretrained Language Models: A Survey. ACM Trans Inf Syst 2023;42:1-60.
Malkov YA, Yashunin DA. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans Pattern Anal Mach Intell 2020;42:824-36. [Crossref] [PubMed]
Wikipedia. Evaluation measures (information retrieval) [Internet]. 2024. Available online: https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)
Cohen J. A Coefficient of Agreement for Nominal Scales. Educ Psychol Meas 1960;20:37-46. [Crossref]
Rosenberg A, Binkowski E. Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In: Proceedings of HLT-NAACL 2004: Short Papers [Internet]. Boston, Massachusetts, USA: Association for Computational Linguistics; 2004:77-80. Available online: https://aclanthology.org/N04-4020
Wikipedia contributors. Jaccard index — Wikipedia, The Free Encyclopedia [Internet]. 2024. Available online: https://en.wikipedia.org/w/index.php?title=Jaccard_index&oldid=1262289626
McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb) 2012;22:276-82. [Crossref] [PubMed]
Wikipedia contributors. One-way analysis of variance — Wikipedia, The Free Encyclopedia [Internet]. 2024. Available online: https://en.wikipedia.org/w/index.php?title=One-way_analysis_of_variance&oldid=1207287507
TUKEY JW. Comparing individual means in the analysis of variance. Biometrics 1949;5:99-114. [Crossref] [PubMed]
Wikipedia contributors. Chi-squared test — Wikipedia, The Free Encyclopedia [Internet]. 2024. Available online: https://en.wikipedia.org/w/index.php?title=Chi-squared_test&oldid=1214715020
Park H, Castaño J, Ávila P, et al. An Information Retrieval Approach to ICD-10 Classification. Stud Health Technol Inform 2019;264:1564-5. [Crossref] [PubMed]
Wikipedia contributors. Precision and recall — Wikipedia, The Free Encyclopedia. 2024. Available online: https://en.wikipedia.org/wiki/Precision_and_recall
Soroush A, Glicksberg BS, Zimlichman E, et al. Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying. NEJM AI 2024. doi: 10.1056/AIdbp2300040.10.1056/AIdbp2300040
KwanK.Large language models are good medical coders, if provided with tools [Internet]. 2024. Available online: https://arxiv.org/abs/2407.12849
D’OosterlinckKKhattabORemyFIn-Context Learning for Extreme Multi-Label Classification [Internet]. 2024. Available online: https://arxiv.org/abs/2401.12178

doi: 10.21037/jmai-24-341
Cite this article as: de Almeida VA, van der Haring EJ, van Boven K, Lopez LF. International Classification of Primary Care (ICPC-2) and search engines: an exploration of three algorithms for information retrieval to aid medical coding. J Med Artif Intell 2025;8:40.