DataScience jupyter notebook in Docker

If you want to dive in deeper this are great places to start:

Instructions

make sure you update the path of the volume where your data will be saved in the docker-compose.yml file
run: docker-compose up
from the logs copy the url to localhost with token and paste it in the browser
stop: docker-compose down

Workshop

Data exploration

read the data

filename='top50.csv'
df=pd.read_csv(filename,encoding='ISO-8859-1', index_col=0)

preview the data
What is the shape of the data?
Rename the columns

df.rename(columns={'Track.Name':'track_name','Artist.Name':'artist_name','Beats.Per.Minute':'beats_per_minute','Loudness..dB..':'Loudness(dB)','Valence.':'Valence','Length.':'Length', 'Acousticness..':'Acousticness','Speechiness.':'Speechiness'},inplace=True)
df.head()

Check for null values

df.isnull().sum()

Fill the null values

# df.fillna(0)
df.fillna(df.mean(), inplace=True)
df.head()

Get a list of all genres genre_list=df['Genre'].values.tolist()
list the frequency of all artists

popular_artist=df.groupby('artist_name').size()
print(popular_artist)

list all artists

artist_list=df['artist_name'].values.tolist()
print(artist_list)

describe the data
make a nice plot

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
columns = list(newdf.columns)

for col in columns:
    plt.ylabel('frequency')
    plt.xlabel(col)
    plt.hist(newdf[col], bins=20)
    plt.show()

Training a model

Extract features and target:

x=df.loc[:,['Energy','Danceability','Length','Loudness(dB)','Acousticness']].values
y=df.loc[:,'Popularity'].values

Split in training and testing data

# Creating a test and training dataset
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

Make a linear regressor and get the model values

regressor = LinearRegression()
regressor.fit(X_train, y_train)

analyse the results

#Displaying the difference between the actual and the predicted
y_pred = regressor.predict(X_test)
df_output = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df_output)

Are these good or bad?

Quantify this

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
DataSciencePresentation.pptx		DataSciencePresentation.pptx
DiabeticClassificationExam.ipynb		DiabeticClassificationExam.ipynb
README.md		README.md
docker-compose.yml		docker-compose.yml
popular-music.ipynb		popular-music.ipynb
top50.csv		top50.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

DataSciencePresentation.pptx

DataSciencePresentation.pptx

DiabeticClassificationExam.ipynb

DiabeticClassificationExam.ipynb

README.md

README.md

docker-compose.yml

docker-compose.yml

popular-music.ipynb

popular-music.ipynb

top50.csv

top50.csv

Repository files navigation

DataScience jupyter notebook in Docker

Instructions

Workshop

Data exploration

Training a model

About

Releases

Packages

Languages

Nxtra/datascience-tutorial

Folders and files

Latest commit

History

Repository files navigation

DataScience jupyter notebook in Docker

Instructions

Workshop

Data exploration

Training a model

About

Resources

Stars

Watchers

Forks

Languages