Skip to content

machine learning jupyter notebooks | data-science | priority | relevant | significant | green-light | 1 | may-2023-filtered | may-2023-filtered-2 | may-2023-filtered-3 | filtered-4

License

CoderSales/machine-learning-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

machine-learning-classification

primary source for this README: jupyter-6-Supervised-Learning

Repository for running jupyter notebooks and keeping relevant files in one place

Updates from

ML-logistic-regression-notes

All content below this point from documentation repository:

documentation

documentation for different repositories

assembling:

part 1:

[closed], W., Wencel, W. and Agrawal, S. (2016) What is the difference between a feature and a label?, Stack Overflow. Available at: https://stackoverflow.com/questions/40898019/what-is-the-difference-between-a-feature-and-a-label#:~:text=Briefly%2C%20feature%20is%20input%3B%20label,region%2C%20family%20income%2C%20etc. (Accessed: 9 February 2023).

part 2: Repos used to compile this README.md :

ML-logistic-regression-notes

machine-learning-classification

part 3: README from Repo 1

ML-logistic-regression-notes

files

QUICKSTART-WIN-VSC-BASH.md from

primary source for this README: machine-learning-classification

Repository for running jupyter notebooks and keeping relevant files in one place

secondary source for this README: jupyter-6-Supervised-Learning

Repository for running jupyter notebooks and keeping relevant files in one place

notes

notes made for previous plan to remove null values

check how to remove null values from dataframe

notes

pandas .iloc() - locate by row, col indices .loc() - locate by row index and col NAME

how to run python files from terminal

Data Cleaning

2.13 Lecture

df.drop('Column name', axis=1) - where axies = 0 for rows, 1 for columns - drops referenced column from data frame - inplace=True argument to ensure column stays dropped. df.drop(1,axis=0).reset_index() - new col with old indices df.drop(1,axis=0).reset_index(drop=True,inplace=True)

df.copy

4.1 Lecture Data Sanity Checks - Part 1

df['columnname'].apply(type).value_counts() - this looks at and notes the values by type and then counts them

df['colname'] = df['colname'].replace('missing','inf'],np.nan) - replaces our specified strings 'missing' and 'inf' - with np.nan

df['colname'] = df['colname'].astype(float) - convert values to float

Review note: when we substitute np.nan in for strings the resulting data type is (if all the other entries are say float) float.

df.info() - rerunning this after data cleaning may result in cleaned columns type changing to, say, float.

Check length of each column Columns shorter than max col length means missing values as empty cells

Alternative approach - clean while loading:

using na_values to tell python which values it should consider as NaN

data_new = pd.read_csv('/content/drive/MyDrive/Python Course/Melbourne_Housing.csv',na_values=['missing','inf'])

  • on load, above line automatically converts all missing and inf to nan so, running: data_new['BuildingArea'].dtype
  • gives dtype('float64') as only float (and nan which seems to be treated as whatever the rest of the data types are)

Review note

data['BuildingArea'].unique()

  • above line run before cleaning gives unique values in column as a numpy array
  • so can inspect to find out which strings to remove.

setup steps

python3 -m venv .venv - in bash - and on Windows source .venv/bin/activate - in bash source .venv/Scripts/activate - on Windows - on VSCode Windows bash /workspace/machine-learning-classification/.venv/bin/python -m pip install --upgrade pip - in GitPod python3 -m pip install --upgrade pip - on Windows

.venv/Scripts/python.exe -m pip install --upgrade pip - in .venv

pip install --upgrade pip pip install jupyter notebook pip install matplotlib pip install pandas pip install seaborn pip install numpy pip install scipy pip install statsmodels pip install -U scikit-learn pip install ipykernel pip install nb-black

Ctrl Shift P Create New Jupyter Notebook Save and name notebook Paste in necessary code

Ctrl Shift P Python: Select Interpreter use Python version in ./.venv/bin/python

pip freeze > requirements.txt

pip install -r requirements.txt

Add required files

pima-indians-diabetes.csv

Extensions

Extension: Excel Viewer - for viewing csv files in VSCode

Debug

jupyter cannot find modules

prelim

per above Python:Select Interpreter 3.10.9 (.venv)

ipykernel bug

after running pip install ipykernel on running LinearRegression_HandsOn-1.ipynb message appears saying: it is necessary to install ipykernel OK installing ipykernel Rerun LinearRegression_HandsOn-1.ipynb

pandas bug

after running pip install pandas pandas not found

Fix for previous 2 bugs

create new jupyter notebook using Ctrl Shift P Create New Jupyter Notebook

Files

summary

  • summary-income.md
    • high level summary of steps in income.ipynb notebook

References

previous repositories

jupyter-test jupyter-repo-2 jupyter-3

References Part2 / (MyGreatLearning, Colab, modules)

MyGreatLearning

pre scikit-learn

scikit-learn

Colab

modules

matplotlib
matplotlib figure dimentions
scipy

References Part3 / (StackOverflow, Git, Tutorials and Repositories)

StackOverflow

https://stackoverflow.com/questions/46419607/how-to-automatically-install-required-packages-from-a-python-script-as-necessary

Git

git

gitignore

Gitpod

Git in VSCode

Tutorials and Repositories

References Part4 / (environments, Packages, Statistics, python, ML, Stats for ML)

environments

local

Windows Anaconda conda create --name .cenv y conda activate .cenv

python3

not installed so Windows store opens install Python 3.10

conda

virtual environment

python environment

python3 -m venv .venv command was slow at first but self-resolved

Packages

NumPy

Pandas

matplotlib

subplots

colors

other matplotlib

boxplot
histplot

error

scipy

scipy.stats

statsmodels

scikit-learn

Documentation

ipykernel

colors for jupyter notebook charts

.venv error [Resolved]

0 Axes error [Resolved]

save Pandas dataframe/series data to figure then to file

Statistics

pandas print statement

python

main.py (files 1 to 4) and script.sh in CoderSales/machine-learning-classification (repository reference below)

storing variables

naming arbitrary number of variables

append

pass multiple variables into string

multiline string python

How do you add value to a key in Python?

pass variable into string variable

turn off pandas index output

concatenate

String into variable

.update() a dictionary

print separate with no spaces

function

ML

Linear Regression

Logistic Regression

Statistics for ML (Logistic Regression)

F-beta score: sklearn documentation

F score

References Part5 / (other, VSCODE workflow window views, HTML, CSS, IMG)

VSCODE workflow window views

  • Keyboard Shortcuts > workbench.action.duplicateWorkspaceInNewWindow Ctrl Shift Alt N (modified from suggested on site) VSCODE workflow window views

font

HTML

CSS

nb-black / jupyter notebook formatting

Images

IMG

SVG

Repositories

References Part 6 / (bash, shell scripting)

subprocess file calls

venv location

shell