2022 TREC Clinical Trials Track

2023 Clinical Trials Track

The vast majority of clinical trials fail to meet their patient recruitment goal. NIH has estimated that 80% of clinical trials fail to meet their patient recruitment timeline and, more critically, many (or most) fail to recruit the minimum number of patients to power the study as originally anticipated. Efficient patient trial recruitment is thus one of the major barriers to medical research, both delaying trials and forcing others to terminate entirely.

Historically, clinical trial recruitment was driven by the trial coordinators (e.g., direct contact with clinical specialists or searching the electronic health record for eligible patients), but recently it has become increasingly common for patients to directly search for trials for enrollment (oftentimes in consultation with their clinician). The 2023 TREC Clinical Trials track simulates this scenario. Instead of using synthetic patient cases as in the 2021 and 2022 tracks, the 2023 track uses a simulated "questionnaire" that the patient or their clinician would fill out in order to identify eligible clinical trials. The track will use several high-level disorder questionnaire templates (e.g., glaucoma, COPD, anxiety), with each template having 5-12 fields customized to that disorder (e.g., the type 2 diabetes template has fields for HbA1c, glucose, BMI, insulin, etc.). For each template, there will be several topics representating synthetic patients with that condition but different values for the fields.

Participants of the track will be challenged with retrieving clinical trials from ClinicalTrials.gov, a required registry for clinical trials in the United States. Clinical trial descriptions can be quite long, but the core aspect of the trial description are the inclusion/exclusion criteria. These are not all-inclusive statements about the trial to the point that other trial information can be ignored, but they are key aspects to defining trial eligibility. The evaluation will further be broken down into eligible, excludes, and not relevant to allow retrieval methods to distinguish between patients that do not have sufficient information to qualify for the trial (not relevant) and those that are explicitly excluded (excludes).

Tentative Schedule

Date	Note
10 May 2023	Document collection available for download (not the same as the 2021-2022 collection)
11 May 2023	Draft Topic Template available
31 May 2023	Applications for participation in TREC 2023 due (contact organizers thereafter)
mid June 2023	Topics available for download along with final Topic Templates
28 August 2023	Submission deadline
October 2023	Relevance judgments and individual evaluation scores released
November 15–17, 2023	TREC 2023 conference at NIST in Gaithersburg, MD, USA (maybe, could be virtual)

Task Description

Documents

Clinical Trials: A May 8, 2023 snapshot of ClinicalTrials.gov will be used as the corpus. You can download those files below, grouped into batch files by trial ID.

NCT00* trials [368 MB]: ClinicalTrials.2023-05-08.trials0.zip
NCT01* trials [366 MB]: ClinicalTrials.2023-05-08.trials1.zip
NCT02* trials [374 MB]: ClinicalTrials.2023-05-08.trials2.zip
NCT03* trials [367 MB]: ClinicalTrials.2023-05-08.trials3.zip
NCT04* trials [344 MB]: ClinicalTrials.2023-05-08.trials4.zip
NCT05* trials [284 MB]: ClinicalTrials.2023-05-08.trials5.zip

The files are formatted using the ClinicalTrials.gov XML schema.

Topics

The topics for the track consist of synthetic patient descriptions based on questionnaire templates. We currently have draft topic templates for eight different disorders. In the actual topics, none of the fields is required (i.e., they may be left blank) and there is no guaranteed format for the provided responses (i.e., each field is natural language, not structured). In essense, this simulates a patient or clinician filling out the questionnaire and they can leave out any information that they do not have available. Obviously, the more information provided about a patient and the more consistent the format in which it is provided, the better they can be matched. It is reasonable and expected in a clinical environment, however, that not all possible information will be available, nor is it worth the time and cost to acquire. See the example topics below for how a template is instantiated as a patient-specific topic

< Glaucoma Template >	Patient1	Patient2
diagnosis	POAG	uveitic glaucoma
intraocular pressure	19 mmHg	22 mmHg
visual field		advanced damage
visual acuity	20/80	20/200
prior cataract surgery	no	no
prior LASIK surgery	no	no
comorbid ocular diseases		uveitis

< COVID-19 Template >	Patient3	Patient4
diagnosis	PCR-confirmed	never
symptoms	fever, cough, headache, fatigue
hospitalization	yes	no
ventilation	no	no
vaccination status	unvaccinated	fully vaccinated
oxygen saturation	92%
comorbid respiratory diseases		asthma

Obtaining the Topics

The topics will be provided below once available:

topics2023.xml

The topics are formatted in XML:

<topics task="2023 TREC Clinical Trials">
  <topic number="-1" template="glaucoma">
    <field name="diagnosis">POAG</field>
    <field name="intraocular pressure">19 mmHg</field>
    <field name="visual field"></field>
    <field name="visual acuity">20/80</field>
    <field name="prior cataract surgery">no</field>
    <field name="prior LASIK surgery">no</field>
    <field name="comorbid ocular diseases"></field>		
  </topic>
</topics>

The 2021 and 2022 topics had a very different structure (free-text narratives along the lines of a case report or electronic health record note), but could perhaps be useful to participants.

Evaluation

The evaluation will follow standard TREC evaluation procedures for ad hoc retrieval tasks. Participants may submit a maximum of five automatic or manual runs, each consisting of a ranked list of up to one thousand IDs (NCT IDs provided by ClinicalTrials.gov). The highest ranked results for each topic will be pooled and judged by physicians trained in medical informatics. Assessors will be instructed to judge trials as either eligible (patient meets inclusion criteria and exclusion criteria do not apply), excluded (patient meets inclusion criteria, but is excluded on the grounds of the trial's exclusion criteria), or not relevant. Because we plan to use a graded relevance scale, the performance of the retrieval submissions will be measured using normalized discounted cumulative gain (NDCG).

Submission Instructions

The tentative submission deadline will be sometime around August 2023 (follow the mailing list for updates).

Submission File Format

The format for run submissions follows the standard trec_eval format. Each line of the submission file should follow the form:

TOPIC_NO Q0 ID RANK SCORE RUN_NAME

where TOPIC_NO is the topic number (1–30), 0 is a required but ignored constant, ID is the identifier of the retrieved document (PMID or NCT ID), RANK is the rank (1–1000) of the retrieved document, SCORE is a floating point value representing the confidence score of the document, and RUN_NAME is an identifier for the run. The RUN_NAME is limited to 12 alphanumeric characters (no punctuation).

The file is assumed to be sorted numerically by TOPIC_NO, and SCORE is assumed to be greater for documents that should be retrieved first. For example, the following would be a valid line of a run submission file:

1 Q0 NCT00760162 1 0.9999 my-run

The above line indicates that the run named "my-run" retrieves for topic number 1 document NCT00760162 at rank 1 with a score of 0.9999.