Skip to content

This is an optional model development project on a real dataset related to predicting the different progressive levels of Alzheimer’s disease (AD).

License

Notifications You must be signed in to change notification settings

edaaydinea/OP1-Prediction-of-the-Different-Progressive-Levels-of-Alzheimer-s-Disease

Repository files navigation

OP1 - Prediction of the Different Progressive Levels of Alzheimer's Disease

Table of Contents

A.Business Understanding - Project Objective

  • This is an optional model development project on a real dataset related to predicting the different progressive levels of Alzheimer's disease (AD). The students are expected to use tensorflow library for modeling process and will be asked to submit predicted labels for a test dataset by which their score will be evaulated objectively.
  • This project is included in the UpSchool - Google Developers Machine Learning - Deep Learning Program.
  • In this project, you are supposed to provide a data science model to determine the level of Alzheimer disease. The levels are the ordinal categories from lower to higher respectively: 0, 0.25, 0.50, 1.0, 2.0, 3.0 (that are the progressive levels of Alzheimer's disease)
  • You are expected to use the following features:

['EDUC','NACCMOCA','MARISTAT','NACCFAM','NACCGDS','NACCNE4S','NACCAPOE', 'INDEPEND','RESIDENC','ANYMEDS','NACCAMD','DEL','HALL','DEPD','ANX','APA','DISN', 'IRR','MOT','AGIT','ELAT','NITE','APP','DROPACT','NACCAGEB','SEX']

B.Data Understanding

Index Variable Name Section Variable type Data type Short Descriptor Data Source Allowable codes Missing Codes Description / derivation
1 SEX A1 - Subject Demographics Original UDS question Numeric cross-sectional Subject's sex rdd 1 = Male
2 = Female
2 EDUC A1 - Subject Demographics Original UDS question Numeric cross-sectional Years of education rdd 0 - 36
99 = Unknown
In general,
12 = high school or GRE,
16 = bachelor's degree,
18 = master's degree,
20 = doctorate.
Note that although this variable is not collected at follow-up visits, the value from the initial visit will be shown at all follow-up visits.
3 MARISTAT A1 - Subject Demographics Original UDS question Numeric longitudinal Marital Status rdd 1 = Married
2 = Widowed
3 = Divorced
4 = Separated
5 = Never married (for marriage was annulled)
6 = Living as married/domestic partner
8 = Other or unknown
Note that in v1– 2 there was an option for “other” status. These have been recoded to maristat = 9.
4 INDEPEND A1 - Subject Demographics Original UDS question Numeric longitudinal Level of independence rdd 1 = Able to live independently
2 = Requires some assistance with complex activities
3 = Requires some assistance with basic activities
4 = Completely dependent
9 = Unknown
5 RESIDENC A1 - Subject Demographics Original UDS question Numeric longitudinal Type of residence rdd 1 = Single- or multi-family private residence
(apartment, condo, house)
2 = Retirement community or independent group living
3 = Assisted living, adult family home, or boarding home
4 = Skilled nursing facility, nursing home, hospital, or hospice
9 = Other or unknown
Note that in v1– 2 there was an option for “other” type of residence. These have been recoded to residenc = 9.
6 NACCAGEB A1 - Subject Demographics NACC derived variable Numeric cross-sectional Subject's age at initial visit rdd 18 - 120 Birth month and year are required elements in the UDS; however, birth day is not collected. To calculate naccageb, birth day is set to 1 for all subjects. Baseline age is then computed as initial visit date minus birth date. Note that although this variable is listed for all visits, it does not change across visits; it is cross-sectional.
7 NACCFAM A3 - Subject Family History NACC derived variable Numeric cross-sectional Indicator of first-degree family member with cognitive impairment rdd 0 = No report of a first-degree family member with cognitive impairment
1 = Report of at least one first-degree family member with cognitive impairment
9 = Unknown
-4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
UDS Form A3 version 1 – 2, submitted at all available visits: Subjects reporting at least one parent, sibling, or child with dementia at any visit will have naccfam = 1. Subjects who report no first-degree family members with dementia at all visits where Form A3 is submitted will have naccfam = 0.
UDS Form A3 version 3.0 or subsequent versions, submitted at all available visits: If at least one parent, sibling, or child is reported to have both a primary neurological problem/psychiatric condition of cognitive impairment/behavior change (coded as 1) and one of the primary diagnosis codes listed below at any visit, then naccfam = 1. Subjects who report all first-degree family members as having a family history absent of cognitive impairment/psychiatric condition (primary neurological problem/psychiatric condition coded as 2–8) or a primary neurological problem/psychiatric condition is reported (coded as 1), but a code other than those listed below is reported, will have naccfam = 0.
For subjects with Form A3 data from multiple form versions, all available data will be included in the calculation of naccfam. For example, if a family history of cognitive impairment is indicated on Form A3 using v3.0 but not on a previous version using v1–2, the subject will still have naccfam = 1.
Those with a submitted Form A3 (any version) who are missing data on all first-degree family members are coded as Unknown (naccfam = 9). If some first-degree family members are coded as No and some are coded as Unknown, then they are all coded as Unknown (naccfam = 9).
In general, a known history of cognitive impairment reported at any visit supersedes all visits with missing codes. Likewise, an indication of cognitive impairment at any visit supersedes all other visits where a history of cognitive impairment is indicated as not present. In all other conditions where reporting varies, data from the most recent visit are used to calculate naccfam.
If Form A3 was never submitted for any version of the UDS, naccfam will take a value of -4. Note that although this variable is listed for all visits, it does not change across visits; it is cross-sectional.
8 ANYMEDS A4 - Subject Medications Original UDS question Numeric longitudinal Subject taking any medications rdd 0 = No
1 = Yes
-4 = Did not complete medications form
If the medications form was not completed, then anymeds = - 4.
9 NACCAMD A4 - Subject Medications NACC derived variable Numeric longitudinal Total number of medications reported at each visit rdd 0 - 40
-4 = Did not complete medications form
This variable provides the total number of medications reported at a visit including all prescription and over the counter medications reported on UDS Form A4 at a single visit. If the medications form was not completed, then naccamd = -4.
10 CDRGLOB B4 CDR® Plus NACC FTLD Original UDS question Numeric longitudinal Global CDR® rdd 0.0 = No impairment
0.5 = Questionable impairment 1.0 = Mild impairment
2.0 = Moderate impairment
3.0 = Severe impairment
11 DEL B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Delusions in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (del=9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
12 HALL B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Hallucinations in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (hall = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
13 AGIT B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Agitation or aggression in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (agit = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
14 DEPD B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Depression or dysphoria in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (depd = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
15 ANX B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Anxiety in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (anx = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
16 ELAT B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Elation or euphoria in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (elat = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
17 APA B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Apathy or indifference in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (apa = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
18 DISN B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Disinhibition in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (disn = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
19 IRR B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Irritability or lability in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (irr = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
20 MOT B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Motor disturbance in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (mot = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
21 NITE B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Nighttime behaviors in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (nite = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
22 APP B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) Original UDS question Numeric longitudinal Appetite and eating problems in the last month rdd 0 = No
1 = Yes
9 = Unkown
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
An option of Unknown (app = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.
23 NACCGDS B6 Geriatric Depression Scale (GDS) NACC derived variable Numeric longitudinal Total GDS Score rdd 0 - 15
88 = Could not be calculated
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
In earlier versions of the UDS, Centers were not given instructions on how to calculate the total GDS score if three or fewer GDS items were missing. NACC has created a new derived variable for Total GDS score so that subjects who were given the GDS in the earlier versions of UDS v1 will have a total GDS score if they skipped three or fewer items on the questionnaire. If the subject was missing more than three of the 15 items on the GDS for any UDS version, naccgds = 88. The UDS Coding Guidebook for Form B6 provides the algorithm for calculating the GDS score when three or fewer items are missing.
24 DROPACT B6 Geriatric Depression Scale (GDS) Original UDS question Numeric longitudinal Have you dropped many of your activities and interests? rdd 0 = No
1 = Yes
9 = Did not answer
- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question
Note that an option of 9 = Did not answer was added to UDS v3.0 and subsequent versions.
25 NACCAPOE NACC derived variable Numeric cross-sectional APOE genotype rdd-genetic 1 = e3,e3
2 = e3,e4
3 = e3,e2
4 = e4,e4
5 = e4,e2
6 = e2,e2
9 = Missing/ unknown/ not assessed
APOE genotype is run independently by the ADC and reported to NACC on the NACC Neuropathology Form. APOE genotype is also reported by ADGC and NCRAD. In the rare case that the ADC-reported genotype and the genotype reported by ADGC are not the same, the genotype is set to 9 = Missing for that subject.
26 NACCNE4S NACC derived variable Numeric cross-sectional Number of APOE e4 alleles rdd-genetic 0 = No e4 allele
1 = 1 copy of e4 allele
2 = 2 copies of e4 allele
9 = Missing/ unknown/ not assessed
APOE genotype is run independently by the ADC and reported to NACC on the NACC Neuropathology Form. APOE genotype is also reported by ADGC and NCRAD. In the rare case that the ADC-reported genotype and the genotype reported by ADGC are not the same, the genotype is set to 9 = Missing for that subject.
  • The shape of the dataset is (9180, 38)
    • There are 9180 observations and 38 variables.
  • There is no missing values in the dataset.
  • In 38 variables, 32 of them contain categorical data, 6 of them numerical data and 32 nominal data.
    • Categorical column names: ['NACCFAM', 'NACCNE4S', 'ANYMEDS', 'DEL', 'HALL', 'DEPD', 'ANX', 'APA', 'DISN', 'IRR', 'MOT', 'AGIT', 'ELAT', 'NITE', 'APP', 'DROPACT', 'SEX', 'MARISTAT_1', 'MARISTAT_2', 'MARISTAT_3', 'MARISTAT_4', 'MARISTAT_5', 'MARISTAT_6', 'INDEPEND_1', 'INDEPEND_2', 'INDEPEND_3', 'INDEPEND_4', 'RESIDENC_1', 'RESIDENC_2', 'RESIDENC_3', 'RESIDENC_4', 'CDRGLOB']
    • Numerical column names: ['EDUC', 'NACCMOCA', 'NACCGDS', 'NACCAPOE', 'NACCAMD', 'NACCAGEB']
    • Nominal column names: ['NACCFAM', 'NACCNE4S', 'ANYMEDS', 'DEL', 'HALL', 'DEPD', 'ANX', 'APA', 'DISN', 'IRR', 'MOT', 'AGIT', 'ELAT', 'NITE', 'APP', 'DROPACT', 'SEX', 'MARISTAT_1', 'MARISTAT_2', 'MARISTAT_3', 'MARISTAT_4', 'MARISTAT_5', 'MARISTAT_6', 'INDEPEND_1', 'INDEPEND_2', 'INDEPEND_3', 'INDEPEND_4', 'RESIDENC_1', 'RESIDENC_2', 'RESIDENC_3', 'RESIDENC_4', 'CDRGLOB']

Histogram of Binary Target Categories (Before SMOTE Oversampling)

C.Data Analysis

  • No data dropping process was performed.
  • Quantile values were determined as 0.25 and 0.75, and the values above these values were perceived as outlier and the upper and lower values were equalized to Threshold values.
  • There was no missing data.

  • In both male and female patients, it was observed that anxiety, depression, irritability and apathy values affect moderate impairment.
  • Chi-Square test was performed for nominal variables.At the end of this, the P-Value value of more than 0.5 ['naccfam', 'maristat_4', 'maristat_6'] was decided not to use the model.
  • ANOVA test was performed for numerical variables. At the end of this, it was observed that the P-Value value was not larger than 0.5.

D.Feature Enginering

  • Label Encoding was performed. But it was found that there was no column that should be made Label Encoding.
  • One-Hot Encoding was performed. At the end of this, It was observed that this process should be done in two features (['NACCNE4S', 'NACCAPOE']).

E. Modeling

  • The data imbalance in the train datas was removed with Smote OversamPling before the model was performed.

Histogram of Binary Target Categories (After SMOTE Oversampling)

  • Two different stages were established in models.
    • Baseline Model
    • Estimator / Classifier Selection (Hyperparameter Tuning)

Comparison of Validation Accuracy Result - Validation Tunned Accuracy Result

Comparison of Test Accuracy Result - Test Tunned Accuracy Result