Skip to main content

Olofin Daniel Oluwafemi shows a very good example how to analyse, visualize and interpret publicly available medical data (Statistical Analysis of Genetical Disease (Breast Cancer).

He found his input data on Kaggle, which contained 2.2K cancer related databases at the preparation of this review. Beyond cancer the total number of ‘quality databases’ – as they call them – is 245K while the community counts 14M machine learners who share their own experiences (and program codes!) gained during working with the Kaggle datasets.

Olofin’s dataset of breast cancer patients was obtained from the 2017 November update of the SEER Program of the NCI, which provides information on population-based cancer statistics. The dataset involved female patients with infiltrating duct and lobular carcinoma breast cancer (SEER primary cites recode NOS histology codes 8522/3) diagnosed in 2006-2010. Patients with unknown tumour size, examined regional LNs, positive regional LNs, and patients whose survival months were less than 1 month were excluded; thus, 4024 patients were ultimately include. Read more here.

The cited article is about answering the following questions:

  • What’s the percentage of patient’s race?
  • What is the marital status of the patients?
  • What is the percentage of the grade level of the patient’s cancer?
  • What is the percentage of the tumor level of the cancer patient?
  • Current status of the patients
  • What is the percentage of patients’ tumor levels by their race?
  • What is the percentage of patients’ grade levels by their race?
  • Current status of the patients by grade levels
  • Current status of the patients by tumor levels
  • What is the status ratio among the patients?

The applied methodology was:

  • Data investigation
  • Exploratory analysis
  • Statistical analysis
  • Conclusion

While it is not mentioned in the article, visualisation was in the focus, too. All the work were performed with the help of a free software tool, namely R. Important parts of the code can also be found in the article but not the whole code. You might be disappointed in case of unseccessful reproduction of the outcomes if you are familiar with R on base level at least.

At the end of the analysis two hypotheses were also tested:

  1. a) There is no significant relationship between grade levels of breast cancer and status of the patients
  2. b) There is no significant relationship between tumor level and status of the patients.

The article – I guess – does not contain any revolutionay medical discovery (it turnes out that both grade and tumor level significantly impact the status of the patients). But this is a very concise example for

  1. a) how and where to find relevant (medical) data source
  2. b) how to apply free analytical tools (like R)
  3. c) how to address research questions and hypotheses and how to answer them, finally
  4. d) how to visualise outcomes – even if the outcomes of this research were easy to interpret.