- We are tasked with answering the questions provided by our boss. Providing a professional email, final notebook, readme file, and google slide to the boss to address a minimum of 5 questions.
- Construct an email answering at least 5 questions
- Deliver a final report to the data science team
- Deliver a slide with key points
-
- Which lesson appears to attract the most traffic consistently across cohorts (per program)?
-
- Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
-
- Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
-
- Is there any suspicious activity, such as users/machines/etc. accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
-
- At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
-
- What topics are grads continuing to reference after graduation and into their jobs (for each program)?
-
- Which lessons are least accessed?
-
- Anything else I should be aware of?
- Data acquired from the Codeup Database and the provided anonymized-curriculum-access.txt document
- Files were merged on id and cohort_id respectively
- It contained 900,223 rows and 8 columns
- The data was aquired on 14 JUNE 2023
- Each row represents a page access from the Codeup lesson server
- Each column represents a feature of the access event
- Prepare data
- The only column with null values from the .txt file was cohort_id which we filled with 0
- After adding the sql pull to get the cohort's name, start_date, end-date, and program_id; nulls were created in those columns when cohort_id was We filled those nulls with "Unknown cohort", 2000-01-01, 2000-01-01, and 0 respectively.
- The date column was changed to a datetime, set as the index, and the index was sorted (earliest to latest)
- No columns were removed or renamed
- No additional features were added
- No encoding, scaling was accomplished
- Data was not split into train/validate/test for this analysis
- Outliers were not addressed as they were part of the target
- Answer the following initial questions
-
- Which lesson appears to attract the most traffic consistently across cohorts (per program)?
-
- Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
-
- Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
-
- Not addressed
-
- Not addressed
-
- What topics are grads continuing to reference after graduation and into their jobs (for each program)?
-
- Which lessons are least accessed?
-
-
- The lessons with the most traffic consistently across cohorts (per program) are:
- WebDev (Pro.1): javascript-i
- WebDev (Pro.2): javascript-i
- Data Science (Pro.3): classification/overview is the most accessed lesson
- Apollo cohort (Pro.4): content/html-css
- Unknown Group (Pro.0): javascript-i
-
- The only program where a cohort that referred to a lesson significantly more than other cohorts was the Data Science program
- The advanced-dataframes lesson was accessed a lot by Bayes, but very little by Curie and Darden
- The Timeseries explore lesson was accessed a lot by Bayes and Curie, but very little by Darden
-
- There are 10 users in the dataset who, while active, accessed the curriculum <= 10 times
- All users were in program 2
- They were in 9 separate cohorts (2 users in the same cohort)
- Seven of the ten users accessed the curriculum on the first or second day of class only, indicating students who may have dropped out
- 3 of the 10 accessed the curriculum much later in the program
- User 278, 812, 832 from Voyageurs, Hyperion, and Jupiter cohorts respectively
- No good explanation for this: could be an error in capturing the data or some sort of unauthorized access
- There are 10 users in the dataset who, while active, accessed the curriculum <= 10 times
-
- The most referenced topics after graduation are:
- Web Development - Java and Javascript
- Data Science - SQL and classification
-
- Lesson accessed the least is collection of 457 lesson pages that were only accessed once.
Feature | Datatype | Key | Definition |
---|---|---|---|
date | datetime64 | YYYY-MM-DD | Date of activity; Index |
endpoint | object | unique | In the url, everything after "...codeup.com/" |
user_id | int64 | unique # | Unique ID # assigned to user |
cohort_id | int64 | unique # | Unique ID # assigned to cohort |
source_ip | object | IP ##.###.##.## | Unique IP address assigned to user device |
name | object | unique | Name assigned to cohort |
start_date | datetime64 | YYYY-MM-DD | Date cohort started |
end_date | datetime64 | YYYY-MM-DD | Date cohort graduated |
program_id | int64 | 1,2,3,4 | Program identifier: 1,2,4 - Webdev; 3 - Data Science |
-
- Data acquired from the Codeup Database and the provided anonymized-curriculum-access.txt document
- Files were merged on id and cohort_id respectively
- Ensure you have your .env with credentials in the same folder
- Ensure you have your anonymized-curriculum-access.txt document in the same folder
-
- Clone this repo.
-
- Put the anonymized-curriculum-access.txt file and your .env containing credentials into project folder containing the cloned repo.
-
- Run notebook.
-
Which lesson appears to attract the most traffic consistently across cohorts (per program)?
- The lessons withe most traffic consistently across cohorts (per program) are:
- WebDev Programs 1, 2 and unassigned: javascript-i
- Data Science Program 3: classification overview
- The lessons withe most traffic consistently across cohorts (per program) are:
-
Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
- The only program where a cohort that referred to a lesson significantly more than other cohorts was the Data Science program
- The advanced-dataframes lesson was accessed a lot by Bayes, but very little by Curie and Darden
- The Timeseries explore lesson was accessed a lot by Bayes and Curie, but very little by Darden
- The only program where a cohort that referred to a lesson significantly more than other cohorts was the Data Science program
-
Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
- There are 10 users in the dataset who, while active, accessed the curriculum <= 10 times
- All users were in program 2
- They were in 9 separate cohorts (2 users in the same cohort)
- Seven of the ten users accessed the curriculum on the first or second day of class only, indicating students who may have dropped out
- 3 of the 10 accessed the curriculum much later in the program
- User 278, 812, 832 from Voyageurs, Hyperion, and Jupiter cohorts respectively
- No good explanation for this: could be an error in capturing the data or some sort of unauthorized access
- There are 10 users in the dataset who, while active, accessed the curriculum <= 10 times
-
What topics are grads continuing to reference after graduation and into their jobs (for each program)?
- The most referenced topics after graduation are:
- Web Development - Java and Javascript
- Data Science - SQL and classification
- The most referenced topics after graduation are:
-
Which lessons are least accessed?
- Lesson accessed the least is collection of 457 lesson pages that were only accessed once
- Provide additional takeaways or downloadable docs for extensively used topics
- Investigate the need to redo or reorganize the information on the 457 seldom used pages
-
- If provided more time we could have looked further into the additional two questions, and connected unknown users to cohorts