Base paper Reference: https://arxiv.org/abs/2105.08321
Blog written by: Sethu (Myself), Ramya (Ramya B) and Rohan
Hi there, an early career researcher’s most painstaking task is identifying research gaps in the existing work. From PathCheck foundation, we were trying to take a stab at the COVID 19 Symptom Data challenge organized by Facebook, CMU, and UMD. The challenge was to estimate the percentage of COVID 19 using self-reported symptoms in a non-intrusive manner to accelerate the testing and quarantine process, especially in low-resource settings. Here, the first step was to identify “What has been done in this space already” and iteratively improvise upon the same to solve the problem at hand. Concerning this, we followed a format to identify the gaps or possible novelty that could be brought in the current research work.
A rough introduction to the problem we tried solving:
The rapid progression of the COVID-19 pandemic has provoked large-scale data collection efforts on an international level to study the epidemiology of the virus and inform policies. Various studies have been undertaken to predict the spread, severity, and unique characteristics of the COVID-19 infection across a broad range of clinical, imaging, and population-level datasets. Despite this, the pandemic continues to challenge medical systems worldwide in many aspects, including sharp increases in demands for hospital beds and critical shortages in medical equipment. Compounding to this, many healthcare workers have themselves been infected. This hampers the capacity for immediate clinical decisions and practical usage of healthcare resources.
The most validated diagnostic test for COVID-19, using reverse transcriptase-polymerase chain reaction (RT-PCR), has long been lacking in developing countries. This contributes to increased infection rates and delays in critical preventive measures. So, effective screening enables quick and efficient diagnosis of COVID-19 and can mitigate the burden on healthcare systems.
One such effective screening method was leveraging self-reported symptoms and checking if those symptoms could act as a good indicator for estimating the likelihood of COVID 19. For instance, Anosmia was shown to be the strongest predictor of disease presence(in late September), and a model for disease detection using symptoms-based predictors was indicated to have a sensitivity of about 65%. From the onset of COVID-19, there also has been a significant amount of work in mathematical modeling to understand the outbreak under different situations for different demographics. However, these works primarily focus on the population level. Furthermore, the estimation of different transition probabilities to move between compartments is challenging.
So there is a need to understand trends in the spread of COVID-19 by utilizing the results of self-reported COVID-19 symptoms surveys as an alternative to COVID-19 testing reports. This allows us to assess community disease prevalence, even in areas with low COVID-19 testing ability. Using individually reported symptom data from various populations, we predicted the likely percentage of the population that tested positive for COVID-19.
So given the importance of the problem, we tried taking a stab at it, and this is a sample literature survey/ method we followed.
The method we follow is :
- Pros ( helps us incorporate the learnings from their strenghts)
2. Cons ( helps us proacvtively be prepared for potential pitfalls and understand some well know challenges and limitations)
3. Future Work ( This is a direct give away for us to start and explore a new research topic)
Machine learning-based prediction of COVID-19 diagnosis based on symptoms- Nature (Paper link)
- Large high-quality dataset to get an effective understanding of the disease dynamics. (51831 tested people)
- Training is carried out using unbiased features.
- Most previous models were based on data from hospitalized patients, thus are not effective in screening for SARS-CoV-2 in the general population. This work tries to address this challenge.
- Bias and missing information regarding many features were not handled effectively. For example, for patients labeled as having had contact with a person confirmed to have COVID-19, additional information such as the contact’s duration and location (indoors/outdoors) was not available. Previous studies identified some symptoms (such as lack of smell and taste) as being very predictive of a COVID-19 infection but were not recorded by the Israeli Ministry of Health.
- Note that all the symptoms were self-reported, and a negative value for a symptom might mean that the symptom was not reported. Therefore, it is essential to assess the model’s performance in the circumstance that more values are unreported or missing rather than with negative values. The authors have missed addressing this.
- The results are less interpretative, and the boosting algorithm used is sensitive to outliers since every classifier is obliged to fix the errors in the predecessors. Thus, the method is too dependent on outliers and prone to overfitting. Another disadvantage is that the method is almost impossible to scale up because every estimator bases its correctness on the previous predictors, thus making the procedure difficult to streamline.
Future research directions:
- In parallel to increasing understanding of the contribution of various symptoms to diagnosing the disease, additional symptoms might be integrated into future models.
- The authors could try using the deep learning model and other interpretable ML approaches to understand the non-linearities in the data.
- The evaluation of the pandemic could be considered by accommodating data from other countries, variants, vaccinations, etc.
A machine learning-based exploration of COVID-19 mortality risk — PLOS (Paper link)
- Both invasive and non-invasive features were considered.
- This is one of the first works to explore the predictive power of invasive and non-invasive features. Evaluating invasive biomarkers provides more direct and causal inferences about our physiological state. In contrast, non-invasive features contain broader, indirect information about the body.
- Explainable ML Models were used to estimate the COVID 19 mortality risk.
- The mean age is 62, which neglects most of the younger population, so that analysis may not be robust enough to scale it to the younger population.
- The data gathering interval of this study encompassed the first pandemic wave, and medical records were documented in haste as high patient loads. Limited medical staff forced the medical system to prioritize patient treatment. Therefore, many patients had incomplete medical profiles and were sieved before the data inspection phase. The factors mentioned above limited the sample size of the study.
- The Massih Daneshvari Hospital had more severe and expired patients since it was a primary care center for COVID-19. Thus, this study’s severity and mortality rates do not reflect the population rates of these variables, which could add confounding effects to the study.
Future research directions:
- The analysis does not consider the variation of predictive features concerning different demographics and variants. More extensive and more diverse study populations can be used for further evaluation of our results.
- Future researchers can compare the prediction power of imaging features with laboratory and non-invasive features.
- Future studies can focus on individual groups of comorbidities (e.g., cardiovascular) and additional features to develop separate models to devise specific prognostic models.
A machine learning model to identify early-stage symptoms of SARS-Cov-2 infected patients — Elsevier Public Health Emergency Collection (Paper link)
- Extraction of features from unstructured raw data (hospitalized patient information in text format) using string matching algorithms and this data to construct a processed dataset.
- Identification of the significant symptoms of COVID-19 patients by analyzing their association using five different machine learning approaches.
- Various age group-wise analyses are presented, which is insightful in understanding the variation in symptoms under different cohorts.
- Statistical significance was not calculated.
- These are invasive features and are not self-reported, which limits the usage of this study.
- The data used is from Hospitals which acts as a bottleneck when one has to generalize to a normal population.
Future research directions:
- The size of the COVID-19 dataset was probably not extensive enough to give enough statistical power to resolve the above issues. Hence, the dataset could be increased.
- Exploration of deep learning models could be an effective way to understand non-linearities.
- Understanding the patients’ past medical conditions and the effects of COVID can act as a source of causal inference.
Individual-Level Fatality Prediction of COVID-19 Patients Using AI Methods (Paper link)
- Looks at hyper-personalized fatality prediction.
- Compares deep learning methods with machine learning and highlights importance one over the other in various ways.
- Exhaustive set of features, including various demography.
- The most profound limitation is the lack of quality data used to train the models created. The Wolfram dataset used to train the prediction model only consisted of 1,448 cases in a centralized area. The larger GitHub dataset used contained an increased number of data points but with less specific information on each case, limiting the potential prediction capability of models.
- The dataset was still based on medical records, which in turn limits the generalization factor.
- Additionally, the study did not consider whether patients had received hospital care for COVID-19 treatment prior to their outcome.
Future research directions:
- Since COVID-19 fatality rates are heterogeneous depending on the region, indicated by the Center for Evidence-Based Medicine, additional studies with more representative data would be beneficial. In the future, a model should be created that not only.
- Predicts death but also can predict the severity of the progression of the disease. This will prompt individuals to seek care, which will prevent the debilitating future dispositions that the disease might induce on the infected individual expeditiously. This could prevent many people admitted to the ICU if they were to seek care beforehand.
- By incorporating demographic information, health habits (physical exercise), or psychological factors, occupation, symptoms, and chronic disease of the confirmed case, predictions can be made for the number of required hospitalizations in a given area with the trained model. (combined with their dataset)
Development of a classifier with analysis of feature selection methods for COVID-19 diagnosis (Paper link)
- Both reported clinical symptoms, patient-reported symptoms, and medical history were considered.
- 111 attributes are considered
- Pretty good accuracy of 98.7%, sensitivity of 96.76%, specificity of 98.80%, and AUC of 92%.
- No explainability( both feature elimination and results)
- Highly invasive features like blood tests, etc.