Ingemar J. Cox’s research while affiliated with IT University of Copenhagen and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (245)


EV406/#844 Using online search activity for earlier detection of gynaecological malignancy
  • Conference Paper

November 2024

·

3 Reads

International Journal of Gynecological Cancer

Jennifer Barcroft

·

Elad Yom-Tov

·

·

[...]

·

Introduction Ovarian cancer is the most lethal gynaecological cancer in the UK, yet there is no screening program in place to facilitate early disease detection. The aim is to evaluate whether online search data (OSD) can be used to detect individuals with gynaecological malignancy. Methods This prospective cohort study evaluates OSD in individuals referred with a suspected cancer to a London Hospital (UK) between December 2020 and June 2022. OSD was extracted via Google takeout and anonymised. Health-related terms were extracted (24 months prior to GP referral). A predictive model was developed using (1) search terms and (2) categorised search queries. Area under the ROC curve (AUC) was used to evaluate model performance. 844 women were approached, 652 were eligible to participate, 392 were recruited and 235 completed enrolment. Results The cohort had a median age of 53 years old (range 20-81) and a 26.0% malignancy rate. OSD was different between individuals with a benign and malignant diagnosis, as early as 360 days before GP referral, using all search terms, but only 60 days before, using categorised search queries. A model using OSD from individuals (n=153) who performed health-related searches achieved its highest (sample-corrected) AUC of 0.82, 60 days before GP referral. Conclusion/Implications OSD appears to be different between individuals with malignant and benign gynaecological conditions, with a signal observed in advance of GP referral date. OSD needs to be evaluated in a larger dataset to determine its value as an early disease detection tool and whether its use leads to improved clinical outcomes.


Estimating the household secondary attack rate and serial interval of COVID-19 using social media
  • Article
  • Full-text available

July 2024

·

21 Reads

npj Digital Medicine

We propose a method to estimate the household secondary attack rate (hSAR) of COVID-19 in the United Kingdom based on activity on the social media platform X, formerly known as Twitter. Conventional methods of hSAR estimation are resource intensive, requiring regular contact tracing of COVID-19 cases. Our proposed framework provides a complementary method that does not rely on conventional contact tracing or laboratory involvement, including the collection, processing, and analysis of biological samples. We use a text classifier to identify reports of people tweeting about themselves and/or members of their household having COVID-19 infections. A probabilistic analysis is then performed to estimate the hSAR based on the number of self or household, and self and household tweets of COVID-19 infection. The analysis includes adjustments for a reluctance of Twitter users to tweet about household members, and the possibility that the secondary infection was not acquired within the household. Experimental results for the UK, both monthly and weekly, are reported for the period from January 2020 to February 2022. Our results agree with previously reported hSAR estimates, varying with the primary variants of concern, e.g. delta and omicron. The serial interval (SI) is based on the time between the two tweets that indicate a primary and secondary infection. Experimental results, though larger than the consensus, are qualitatively similar. The estimation of hSAR and SI using social media data constitutes a new tool that may help in characterizing, forecasting and managing outbreaks and pandemics in a faster, affordable, and more efficient manner.

Download


of patient enrolment flowchart. The flowchart outlines the enrolment process for the study cohort (n = 235) and health-related search cohort (n = 153), from individuals referred to a London University Teaching Hospital with a suspected cancer between December 2020-June 2022. It outlines the reasons for incomplete enrolment and exclusion from the study
The time series chart outlining the number of online search queries per discrete category within the study cohort. The time series chart outlines the number of online search queries made per patient, within each distinct symptom category: menopause, urinary, bleeding, bloating, gastrointestinal, vagina, pain etc. stratified by outcome (benign/malignant) up to 490 days in advance of GP referral. The time series are smoothed using a 4-week moving average. Online search activity can identify symptomatic individuals with gynaecological cancer at an earlier stage
Model performance as a function of start and end times. The top figure shows the AUC for the terms model and the bottom figure for the categories model. The start and end times correspond to the duration of time in advance of GP referral date. Different lines correspond to different start (T1) times and the dots on each line correspond to different end (T2) times. Each dot represents the average of 10 runs. Standard deviation is equal, on average, to 0.01 (1.8% of the average AUC)
A histogram (10 bins) of model classification scores, when applied to our sample population (n = 235) and Bing users (n = 1.8 million). The histogram demonstrates the classification score for individual users. A high classification score indicates an increased likelihood of a malignant diagnosis. The Bing user population is distributed towards lower classification scores, in line with benign sample population and a lower likelihood of malignancy
Using online search activity for earlier detection of gynaecological malignancy

March 2024

·

63 Reads

·

3 Citations

BMC Public Health

Background Ovarian cancer is the most lethal and endometrial cancer the most common gynaecological cancer in the UK, yet neither have a screening program in place to facilitate early disease detection. The aim is to evaluate whether online search data can be used to differentiate between individuals with malignant and benign gynaecological diagnoses. Methods This is a prospective cohort study evaluating online search data in symptomatic individuals (Google user) referred from primary care (GP) with a suspected cancer to a London Hospital (UK) between December 2020 and June 2022. Informed written consent was obtained and online search data was extracted via Google takeout and anonymised. A health filter was applied to extract health-related terms for 24 months prior to GP referral. A predictive model (outcome: malignancy) was developed using (1) search queries (terms model) and (2) categorised search queries (categories model). Area under the ROC curve (AUC) was used to evaluate model performance. 844 women were approached, 652 were eligible to participate and 392 were recruited. Of those recruited, 108 did not complete enrollment, 12 withdrew and 37 were excluded as they did not track Google searches or had an empty search history, leaving a cohort of 235. Results The cohort had a median age of 53 years old (range 20–81) and a malignancy rate of 26.0%. There was a difference in online search data between those with a benign and malignant diagnosis, noted as early as 360 days in advance of GP referral, when search queries were used directly, but only 60 days in advance, when queries were divided into health categories. A model using online search data from patients (n = 153) who performed health-related search and corrected for sample size, achieved its highest sample-corrected AUC of 0.82, 60 days prior to GP referral. Conclusions Online search data appears to be different between individuals with malignant and benign gynaecological conditions, with a signal observed in advance of GP referral date. Online search data needs to be evaluated in a larger dataset to determine its value as an early disease detection tool and whether its use leads to improved clinical outcomes.



Neural network models for influenza forecasting with associated uncertainty using Web search activity trends

August 2023

·

80 Reads

·

4 Citations

Influenza affects millions of people every year. It causes a considerable amount of medical visits and hospitalisations as well as hundreds of thousands of deaths. Forecasting influenza prevalence with good accuracy can significantly help public health agencies to timely react to seasonal or novel strain epidemics. Although significant progress has been made, influenza forecasting remains a challenging modelling task. In this paper, we propose a methodological framework that improves over the state-of-the-art forecasting accuracy of influenza-like illness (ILI) rates in the United States. We achieve this by using Web search activity time series in conjunction with historical ILI rates as observations for training neural network (NN) architectures. The proposed models incorporate Bayesian layers to produce associated uncertainty intervals to their forecast estimates, positioning themselves as legitimate complementary solutions to more conventional approaches. The best performing NN, referred to as the iterative recurrent neural network (IRNN) architecture, reduces mean absolute error by 10.3% and improves skill by 17.1% on average in nowcasting and forecasting tasks across 4 consecutive flu seasons.



Figure 3. Model performance as a function of start and end times. The start and end times correspond
Figure 4. A histogram (10 bins) of model classification scores, when applied to our sample population
Using online search activity for earlier detection of gynaecological malignancy

April 2023

·

76 Reads

·

1 Citation

Despite major advances in precision treatments and new biomarkers, only 55% of UK cancers are detected at an early stage. Ovarian cancer (OC) remains the most lethal gynaecological malignancy with survival, quality of life, and fertility implications, as most women present with advanced stage disease. Endometrial cancer (EC) is the most common gynaecological malignancy, with an incidence that is rising exponentially. Unfortunately, an effective screening program does not exist for EC and OC, to facilitate early detection. Conventional healthcare cannot facilitate real-time monitoring of an individual’s health risks. However, digital footprints (e.g. online search activity) with their temporally dense nature could enable regular ‘health’ monitoring. The utilisation of online search data to identify those at risk of specific diseases has been suggested in other studies but were based on individuals with a proxy diagnosis, inferred through online search patterns, rather than confirmed clinical diagnosis. We evaluated the use of online search data to detect gynaecological cancer in individuals with a known confirmed diagnosis. There was a difference in online search patterns between those with benign and malignant diagnoses, as early as 360 days prior to primary care (GP) referral with a suspected cancer. A classification model based on online search data achieved its highest AUC (0.82) using data 60 days before the GP referral, in individuals who routinely performed health-related online search queries. Our experiments indicate that online search data could provide individualised gynaecological cancer risk profiles, which are not reliant on individuals recognising key symptom patterns. Online search data could provide an accessible disease screening tool to facilitate the earlier detection of gynaecological malignancy, complementing currently established approaches. More generally, we advocate that online search data could provide early insights into numerous conditions including a range of cancers.


Figure 1: Cumulative participant recruitment by method
Figure 2: Count of survey completions by week since the start of Virus Watch recruitment (June 2020-May 2022) showing recruitment (total number of participants who completed at least one survey) and retention (total number of participants who completed the latest survey for a given week).
Demographics of Virus Watch study participants
Cohort profile: Virus Watch: Understanding community incidence, symptom profiles, and transmission of COVID-19 in relation to population movement and behaviour

February 2023

·

79 Reads

·

1 Citation

Key Features Virus Watch is a national community cohort study of COVID-19 in households in England and Wales, established in June 2020. The study aims to provide evidence on which public health approaches are most effective in reducing transmission, and investigate community incidence, symptoms, and transmission of COVID-19 in relation to population movement and behaviours. 28,527 households and 58,628 participants of age (0-98 years, mean age 48), were recruited between June 2020 - July 2022 Data collected include demographics, details on occupation, co-morbidities, medications, and infection-prevention behaviours. Households are followed up weekly with illness surveys capturing symptoms and their severity, activities in the week prior to symptom onset and any COVID-19 test results. Monthly surveys capture household finance, employment, mental health, access to healthcare, vaccination uptake, activities and contacts. Data have been linked to Hospital Episode Statistics (HES), inpatient and critical care episodes, outpatient visits, emergency care contacts, mortality, virology testing and vaccination data held by NHS Digital. Nested within Virus Watch are a serology & PCR cohort study (n=12,877) and a vaccine evaluation study (n=19,555). Study data are deposited in the Office of National Statistics (ONS) Secure Research Service (SRS). Survey data are available under restricted access upon request to ONS SRS.


E-NER -- An Annotated Named Entity Recognition Corpus of Legal Text

December 2022

·

206 Reads

·

3 Citations

Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4\% and 60.4\%, compared to training and testing on the E-NER collection.


Citations (68)


... At Owerri in Eastern Nigeria, ovarian cancer (33.5%) was the most common, [27] and in the UK, it was endometrial cancer. [28] A Study in USA reported endometrial cancer as the most common, with a high incidence of 40,000 new cases every year, [29] and in Rawalpindi in Pakistan, endometrial cancer was highest, with a rate of 62.86%. [30] It is not acceptable that in this study, the trend in all the genital tract malignancies remain significantly unchanged over the past 7 years. ...

Reference:

The Trend In Gynaecological Admission Diagnosis And Surgeries,
Using online search activity for earlier detection of gynaecological malignancy

BMC Public Health

... ; https://doi.org/10.1101/2024.05.03.24305695 doi: medRxiv preprint Furthermore, Chen et al. noted that despite the limited datasets in their study, there appears to be a tendency toward heightened online search activity before patients with malignant cases visit a general practitioner. 36 ...

32nd World Congress on Ultrasound in Obstetrics and Gynecology Oral communication abstracts Identification of pathological types of adnexal masses from ultrasound images using deep learning models

... Participants (n=2,010) were a sub-cohort of Virus Watch (n=58,628), a household longitudinal cohort study of SARS-CoV-2 infections in England and Wales running since June 2020. Recruitment and methodology of the full cohort have been described in detail elsewhere (16,17). ...

Cohort Profile: Virus Watch-understanding community incidence, symptom profiles and transmission of COVID-19 in relation to population movement and behaviour

International Journal of Epidemiology

... The research paper "Using Online Search Activity for Earlier Detection of Gynaecological Malignancy" focuses on leveraging Google search data to predict gynecological cancers, particularly ovarian cancer. 34 This study built upon previous research conducted by Soldaini and Yom-Tov, which relied on self-identification in queries for outcomes. 35 However, it is important to note that the present investigation employs clinically verified outcomes, thereby enhancing the robustness and reliability of the findings. ...

Using online search activity for earlier detection of gynaecological malignancy

... Virus Watch is a prospective household community cohort study in England and Wales that recruited participants from June 2020 to March 2022. The full protocol for the study has been described previously 14 . Briefly, the study recruited participants through a campaign using general post, leaflets, social media, letters and SMS from General Practices all directing potential participants to the study website for information and consent. ...

Cohort profile: Virus Watch: Understanding community incidence, symptom profiles, and transmission of COVID-19 in relation to population movement and behaviour

... Pioneering works on Legal NER focus on named entity recognition and resolution on US case law (Dozier et al., 2010) and on the creation of a German NER dataset with fine-grained semantic classes (Leitner et al., 2020). Since deep learning approaches require large-scale annotated datasets and generalpurpose NER models are trained on a different set of entities, publicly available legal NER data sets have recently been made available (e.g., (Au et al., 2022)). Inspired by the recent advances in NER tasks with span representation (Ouchi et al., 2020), in this paper we explore the use of entity-aware attention mechanism (Yamada et al., 2020) to accomplish the L-NER task. ...

E-NER -- An Annotated Named Entity Recognition Corpus of Legal Text

... As COVID-19 spread globally, international travel restrictions and public health measures were implemented following the WHO pandemic declaration [59]. The number of studies using digital signals grew quickly, with an English study using Bing search to detect early warning of COVID-19 [60], and X data to explore symptom keywords for early detection of COVID-19 in Europe and globally [61,62]. During this time, there was an increasing use of multisource digital data, combining search trends and social media data. ...

Providing early indication of regional anomalies in COVID-19 case counts in England using search engine queries

... The main reasons related to reluctance to receive the HZ vaccine include a low perception of disease risk, low confidence in vaccine efficacy and safety, and lack of knowledge about the vaccine availability [17,21]. Studies have shown that people's attitudes, beliefs and emotions about a disease and its vaccine can influence their intention to vaccinate, as occurred in the case of COVID-19 [22][23][24][25][26]; this occurs, in particular, in the case of patients with chronic conditions. ...

Trends, patterns and psychological influences on COVID-19 vaccination intention: Findings from a large prospective community cohort study in England and Wales (Virus Watch)

Vaccine