2019 TREC Precision Medicine Track

2019 Precision Medicine Track

The 2019 track focuses on an important use case in precision medicine for clinical decision support: providing useful precision medicine-related information to clinicians treating cancer patients. The 2019 track will largely repeat the structure and evaluation of the 2017 and 2018 tracks.

As with the 2017 and 2018 tracks, we will be using synthetic cases created by precision oncologists at the University of Texas MD Anderson Cancer Center. Each case will describe the patient's disease (type of cancer), the relevant genetic variants (which genes), and basic demographic information (age, sex). The cases are semi-structured and require minimal natural language processing.

Participants of the track will be challenged with retrieving (1) biomedical articles, in the form of article abstracts (largely from MEDLINE/PubMed), addressing relevant treatments for the given patient, and (2) clinical trials (from ClinicalTrials.gov), addressing relevant clinical trials for which the patient is eligible. The first set of results represents the retrieval of existing knowledge in the scientific literature, while the second represents the potential for connecting patients with experimental treatments if existing treatments have been ineffective.

New this year is the optional task of specifying the particular treatment recommendation for literature articles. This will allow search engines to organize literature articles by the individual treatments, as well as to evaluate the breadth of treatments that a system returns. Previously, there was no means of evaluating the difference between a system that returned many articles related to the same treatment with a system that returned articles related to many different treatments.

Tentative Schedule

Date	Note
May 2019	Document collection available for download.
May 2019	Applications for participation in TREC 2019 due.
Early June 2019	Topics available for download.
August 4, 2019	Submission deadline
October 2019	Relevance judgments and individual evaluation scores released.
November 13–15, 2019	TREC 2019 conference at NIST in Gaithersburg, MD, USA.

Task Description

Documents

There are two target document collections for the Precision Medicine track: scientific abstracts and clinical trials. Both XML and TXT versions are available for both sets. Note that the XML is the official collection, as it has the complete information for each abstract/trial. The TXT versions are provided for ease of processing, but no guarantees are made that all information is contained within these files.

Obtaining the Collection

Scientific Abstracts: The MEDLINE 2019 baseline will be used for the scientific abstracts. The 2019 baseline is a snapshot (roughly mid-December 2018) of PubMed abstracts.

ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ [26 GB]

Clinical Trials: A May 2019 snapshot of ClinicalTrials.gov will be used for the clinical trial descriptions.

Topics

The topics for the track consist of synthetic patient cases created by MD Anderson precision oncologists. The topics consist of the disease, genetic variants, and demographic information about the patients. For example:

	Patient1	Patient2
Disease:	Acute lymphoblastic leukemia	thyroid cancer
Variant:	ABL1, PTPN11	RET, BRAF
Demographic:	12-year-old male	63-year-old female

Obtaining the Topics

The topics will be provided below once available:

topics2019.xml

The topics are formatted in XML:

<topics task="2019 TREC Precision Medicine"> <topic number="1"> <disease>Acute lymphoblastic leukemia</disease> <gene>ABL1, PTPN11</gene> <demographic>12-year-old male</demographic> </topic> ... </topics>

Additionally, the 2017 and 2018 topics might be useful:

Evaluation

The evaluation will follow standard TREC evaluation procedures for ad hoc retrieval tasks. Participants may submit a maximum of five automatic or manual runs for each corpus (scientific abstracts and clinical trials), each consisting of a ranked list of up to one thousand IDs (PMIDs for MEDLINE abstracts and NCT IDs for trials), plus the specific treatments (optional) for just the literature abstracts. The highest ranked results for each topic will be pooled and judged by physicians trained in medical informatics.

Assessors will be instructed to judge abstracts and clinical trials according to each of the three topic dimensions (disease, gene, demographic). Each of these corresponds to 3-4 categories (e.g., a disease can be an "exact", "more general", "more specific", or "not disease" match). Please read the Relevance Guidelines for more details.

Scientific Abstracts: The goal of retrieving scientific abstracts is to identify relevant articles for the treatment, prevention, and prognosis of the disease under the specific conditions for the given patient. Abstracts discussing information not useful for these goals will not be considered relevant. To aid in the treatment extraction aspect new in 2019, we provide a run of MetaMapLite over the MEDLINE corpus:

medline_treatments.tar.gz

The files contain the PMID, UMLS CUI, and concept text for the concepts found by MetaMapLite. Note, we do not consider this list exhaustive or canonical. The human assessors will judge treatments without respect to what is found in these files. We only provide them as an aid to participants who do not wish to perform treatment extraction themselves.

Clinical Trials: The goal of retrieving clinical trials is to identify trials for which the given patient is eligible to enroll, or would have been eligible to enroll had the trial been open. The timing and location of the trial are not factors in determining relevance, only the eligibility criteria.

As in past evaluations of medically-oriented TREC tracks, we are fortunate to have the assessment conducted by the Department of Medical Informatics of the Oregon Health and Science University (OHSU). We are extremely grateful for their participation.

Submission Instructions

The submission deadline will be August 4, 2019.

Submission File Format

The format for run submssions mirrors standard trec_eval format. Each line of the submission file should follow the form:

TOPIC_NO Q0 ID RANK SCORE RUN_NAME "TREATMENT_1" "TREATMENT_2" "TREATMENT_3"

where TOPIC_NO is the topic number (1–30), 0 is a required but ignored constant, ID is the identifier of the retrieved document (PMID or NCT ID), RANK is the rank (1–1000) of the retrieved document, SCORE is a floating point value reprenting the similarity score of the document, and RUN_NAME is an identifier for the run. The RUN_NAME is limited to 12 alphanumeric characters (no punctuation). New in 2019: Up to three treatments per abstract are allowed, each should be in quotes. These are optional, so no treatments may be provided as well, but no more than 3 should be provided. Treatments should only be provided for literature abstracts, not clinical trials.

The file is assumed to be sorted numerically by TOPIC_NO, and SCORE is assumed to be greater for docments that should be retrieved first. For example, the following would be a valid line of a run submission file:

PubMed Abstracts:

1 Q0 28348404 1 0.9999 my-run "nivolumab" "trastuzumab" "fried chicken"

The above line indicates that the run named "my-run" retrieves for topic number 1 document 28348404 at rank 1 with a score of 0.9999. Note that each treatment should be a whitespace-insensitive exact match with the text found in the abstract (so, e.g., newlines should be replaced by a space character).

Clinical Trials:

1 Q0 01155453 1 0.9999 my-run

The above line indicates that the run named "my-run" retrieves for topic number 1 document 01155453 at rank 1 with a score of 0.9999.

TREC Precision Medicine / Clinical Decision Support Track