Skip to content
View arif9799's full-sized avatar
Block or Report

Block or report arif9799

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
arif9799/README.md

Hi there, I am Arif!

LinkedIn   Medium   Facebook   Discord   YouTube   Pinterest   LeetCode   Download Resume  




Snapshot... Me, Myself!


Description

Myself Arif, a Data Scientist with a Master's in Data Science from Northeastern University. My expertise lies in transforming complex datasets into actionable and meaningful insights. I can craft end-to-end automated solutions, translate business inquiries into fully-fledged ML Systems, excel at building statistical & predictive models while simplifying complex concepts and present results to both technical and non-technical stakeholders via creative mediums, including animations!

I've tackled impactful projects at PUMA North America, automated outlier detection in retail sales data and built sales and demand forecasting system for optimizing raw supply inventory. I've also served as a Graduate TA for CS7150-Deep Learning and DS4400- Machine Learning and Data Mining, designed homeworks & rubrics, performed grading and conducted Office Hours to Simplify concepts.

My technical strengths include Time Series Analysis, Machine Learning, Statistics, SQL and Software Integration. Proficient in Python, R, C/C++ and libraries like Pandas, Scikit-learn, Numpy, Tensorflow, PyTorch, Keras, bs4 etc. I am a highly motivated detail-oriented individual with a strong work ethic, keen and always on the verge of learning something new.





In the Pages of My Journey!


Description

Surfing through the internet only to stumble upon this Portfolio, a coincidence? Nah! You’re just in right place. Myself Arif, Data Scientist, Data Engineer & a Python Developer with a passion for automation & zeal to craft solutions that requires minimal or no effort eradicating human intervention. I love fiddling around Datasets, exploring relations, extracting significant insights, retrieving never-heard-before stories, scrutinizing data not only to the point where all WHYs have been answered but uprooting causes to problems one never knew existed in first place, leveraging them to train apt ML algorithms to tackle challenges similar to or even better then humans.📊🔍

Fueled by passion for this domain, preparing myself for an extended progressive career that provides ample opportunities for constantly growing in the field. Ever since my first encounter of stunning capabilities of AI, that triggered the course of events of my life- inducing me to pursue Data Science, I’ve been awe-struck every single time comprehending working mechanism of the feats achieved in this field, which seems creativity is fused with math wherein numbers tend to be more reliable in making decisions then human instincts.🤖📈

I hold a Masters Degree in Data Science from Northeastern University - Khoury College of Computer Sciences and a Bachelor's Degree in Computer Engineering from Gujarat Technological University - Pacific School of Engineering. I am an experienced Data Scientist who can build fully fledged end-to-end applications of Data Science and Machine Learning. Furthermore, I have a strong practical and theoretical experience in the development of Supervised and Unsupervised Machine Learning, Deep Learning and AI Models.🎓💻

I worked at companies in US, such as PUMA North America and Khoury College of Computer Sciences as a Data Scientist and Graduate Teaching Assistant, generating valuable information, insights and conclusion from data and contributing field knowledge to the peers and fellow students of Northeastern. This portfolio demonstrates a wide range of skills that I possess in solving and tackling machine learning problems and is a proof of my work of contribution to the field of Data Science.📚💡



Skills!


Description


The following fragments of tools and technologies along with essentials, define my Capabilities and go-to stack for solving Data Science Problem

Code/ Cloud Mastery Frameworks/ Packages/ Libraries Familiar Libraries/ Softwares IDEs/ Version Control I contribute to Proficient in Tools
Azure
Python
SQL
MySQL
R
YAML
C
C++
Java
Numpy
Pandas
sklearn
Tensorflow
Keras
PyTorch
Darts
SciPy
Plotly
Matplotlib
OpenCV
Docker
Kubernetes
Power Bi
OpenGL
mlflow
Streamlit
Selenium
MicrosoftSQLServer
Flask
GitHub
Colab
Jupyter Notebook
Conda
VSCode
Notepad++
PyCharm
RStudio
LeetCode
Hackerrank
Kaggle
Quora
Reddit
Stack Exchange
Stack Overflow
Linux
macOS
Windows
Microsoft_Excel
Microsoft_Word
Microsoft_PowerPoint
Adobe
Markdown
LibreOffice
Overleaf
LaTeX





Alma Mater


Description





Career Kaleidoscope!


PUMA North America, Data Scientist

July'22- December'22


Description


Successfully delivered and deployed two fully fledged ML Production Grade Projects, that are still in effect today!


SALES AND DEMAND FORECASTING

Predictive models used: AR, MA, ARIMA, SARIMA, Exponential Smoothing, Random Forest Regression, XGBoost, fbProphet, etc
Cloud and Libraries used: Azure ML Studio, Azure Data Factory, Azure DataBricks, pandas, numpy, scikit-learn, Pytorch, Tensorflow, etc

  • Situation: Operational hurdles in fulfilling Customer demands due to inadequate forecast of raw materials, impacting revenue streams
  • Task: Develop Time Series Forecasting System to forecast future Demands, enabling proactive raw material procurement strategies
  • Action: Preformed Data Cleaning, standardization, feature engineering, imputations & deployed Univariate/Multivariate Models on Cloud
  • Results: Achieved 34% reduction in RMSE & enhanced accuracy to 80%, facilitating decision-making efficiency & increased revenue


ANOMALY DETECTION

Predictive models used: AR, MA, ARIMA, SARIMA, Exponential Smoothing, Random Forest Regression, XGBoost, etc
Cloud and Libraries used: pandas, numpy, scikit-learn, Pytorch, openpyxl, ADTK, Seaborn, MatplotLib, sktime, darts, StatsModels etc

  • Situation: Data transmission issues from local store registers, leads to significant gaps in analytical reports, requiring intervention
  • Task: Develop Python application using Machine Learning and Time Series methods to automate anomaly detection in retails sales data
  • Action: Implemented automated data retrieval & trained Univariate time series models for various rolling stats, streamlining detection
  • Results: Achieved 90% reduction in man-hours with 60% accuracy, deployed as a Python APK for outlier detection and rectification.


Jan'22- December'22


Description



  • CS 7150: Deep Learning
    • The course curriculum needed enhancement to emphasize practical applications of Neural Networks including Diffusion and LLMs
    • Helped restructure the curriculum, create an interactive learning environment & building logistics & infrastructure
    • Designed and implemented rubrics & advanced homeworks on topics like Transformers and Diffusion Models from scratch
    • Set up discussion panels on latest AI research, invoking live discussions while fostering 90% increase in student participation
    • Restructured curriculum resulted in students not only grasping theoretical concepts but also understanding real-world relevance
  • DS 4400: ML & Data Mining
    • The course aimed to provide students with a comprehensive understanding of Data Science,ML and Advanced Predictive Analytics
    • My role was integral to bridging theoretical knowledge with hands-on experience in building ML Models from scratch in Python
    • Simplified complex academic theories into simple laymen terms, translating them into tangible real-world applications
    • Guided students through experiential learning projects to provide a practical bridge between theory and application
    • Contributed to comprehensive course coverage, ensuring a strong foundation in both theory and practice
    • Students gained a robust understanding of the subject, applying theoretical knowledge to real-world scenarios successfully




Project Portfolio: Crafting Solutions

The following section will showcase my skills further utilized in personal and curriculum Projects. I invite you to explore the github profile and readme files of the Projects for further information as this is just a short summary into the vast array of possibilities where my skills can be applied, post collaboration with SMEs.



Description


You speak, We'll find it!

Used an API for accessing flickr30k dataset to extract entities of Annotation (B-box), Phrases & Images, assembling pipeline to extract, transform & collate from multiple data source. Then, Pre-calculated high level general representation embeddings of Images & Textual content using Pre-trained Vision Transformers & BERT respectively. Moved on to Build & train baseline Transformer Encoder-Decoder Model on concatenated embeddings of image and text, to predict B-boxes with 75% IoU achieving an accuracy of 58% that can significantly spot objects in images as described by corresponding context provided. Also, developed and experimented light-weight architectures like Textual-Encoder & Decoder, Vision-Encoder & Decoder and Decoder Only Network with equivalent performance that outperforms the baseline models.



Description


The Canvas Conundrum: Imposing Style of an image onto Contents of another

An endeavor of imposition of an artistic style image onto contents of another, employing Transfer Learning concept using pre-trained CNN Model VGG-19. Normalized Images, built content-style loss function & convolved through CNN (with frozen weights) while back-propagating summed loss to Noise Image. Also, performed Hyper-parameter Tuning to find optimal values of Learning Rates, 'ɑ' & 'β' (ɑ,β determine proportion of content & style to be injected) with 21% MSE Loss



Description


OpinioCraft: Unleashing Sentimental Insights through Unsupervised ML

An unsupervised approach to mine opinions, thoughts and emotions based on the mathematical notion of the words that determines the sentiment of the reviews that are being processed to achieve results for recommendation. The principal focus is to retrieve user’s search query (Product & Category), based on which the user will be recommended top-n products from that category alone. The Underlying mechanism in simplest terms is to figure out the sentiments of the reviews either as positive or negative, followed by clustering unique items to decide top-k products based on higher average of connotation scores.



Description


RateVue: Decoding IMDb – A Feature Alchemy

Initialized with importing primary dataset of 45k+ records, merging it with secondary dataset to handle missing values of certain variables, then validating custom procedures focused on Data Wrangling, typecasting, pivoting erratic variables to Sparse Matrix and much more. Conducted Univariate Exploratory Data Analysis to explore relations among dependent and independent Variables. Trained the simple models namely Logistic Regression & kNN which outperformed complex ones such as Decision Tree & Random Forest.



Description


InsuLens: Focusing Clarity in Diabetic Classification

Cleaned and preprocessed anthropometric datasets with a whopping 1.8 million observations collected from 9 different states in India. Analyzed and performed hyperparameter tuning with 'Grid Search Cross Validation' to derive optimal Parameters for training the MultiLayer Perceptron Classifier to classify the Diabetics. Upsampled the minority class from the imbalanced dataset using SMOTE technique that drastically increased the accuracy of predicting diabetic class from 13% to an impressive 71.4%.



Description


DollaLlama: Wrangling Sales Data with Quirky Precision

Coalesced 180k+ records of sales into a file, performed Data Wrangling & Mining and Feature Engineered Variables, Envisioned strategic analysis based on Month, Quantities, Revenue generated & best-sellers to drive product decisions and Analyzed consumer behavior pattern of sales & extrapolated items to recommend based on frequently bought together.Electronic Appliances Sales Data – Exploratory Data Analysis Coalesced 180k+ records of sales into a file, performed Data Wrangling & Mining and Feature Engineered Variables, Envisioned strategic analysis based on Month, Quantities, Revenue generated & best-sellers to drive product decisions and Analyzed consumer behavior pattern of sales & extrapolated items to recommend based on frequently bought together.



Description


Tomorrow's Time, Today's Numbers: Life Expectancy in a Snap

Forecasted life expectancy by constructing a Linear Regression Model on independent attributes of primary Dataset, with subsequent Feature Engineering of 5 Candidate Predictors & selection of 3 independent variables as Predictors for the Regression Model. Conclusion with Accomplishment an RMSE of 0.0095, exhausting all combination of predictors with response variable ‘Life Expectancy’.



Description


Graphonomics: Crafting a Visual representation of Economic Growth

Started with extraction of demographic economic data from ‘World Development Indicator’ Datasets in WDI Package in ‘R’ language and then Analyzed data by plotting time series graphs of the GDP of certain countries for last 6 decades and constructing a Mini-Poster to Contrast. Finally, Inferencing various peaks of GDP & correlations of the variables & made presumptions of “The Great Recession”.



Medium Blogs: Turning Geek-Speak into Shakespeare, One Article at a Time.



Description


In this Article, I describe in detail how we built an Opinion based unsupervised Recommendation Engine utilizing the Amazon reviews dataset. Based on word embeddings, clustering techniques and custom calculation of Connotation Scores, we determnine the top-n products to recommend to a particular user based on their search queries alone. The workflow involves data pre-processing, text tokenization, model training using Word2Vec, KMeans clustering, class determination, sentiment scoring, and generating personalized recommendations based on user queries and product categories. The project is generic and applicable to various review datasets





Description


In this Article, I delve into time series forecasting with a focus on calculating prediction intervals, emphasizing the variation in width for time series data compared to non-time series data. I've involved and shed some light on concepts such as confidence intervals, confidence levels, prediction intervals, normal distributions and z-values. In this article I've explored statistical basics, including confidence intervals, population vs. sample, and normal distribution. Also, highlighting the use of z-values to standardize normal distributions and introduces the Central Limit Theorem. In the final section, I explain prediction intervals for both single-step and multi-step forecasting, considering the variability of standard error with the forecasting horizon. The content concludes with a practical example of predicting stock prices and calculating corresponding prediction intervals.





Description


In this Article, I explore the evolution of deep learning architectures, focusing on the shortcomings of Recurrent Neural Networks (RNNs) that led to the rise of Transformer Neural Networks. I have covered concepts such as positional encoding, the relational database's Key-Query-Value analogy serving as the basis for transformer self-attention and the transition from RNNs to Transformers. This article emphasizes the importance of attention mechanisms, detailing the Self-Attention mechanism and its role in learning the structure of input sequences. Additionally, this article introduces Multi-head Attention in Transformers, demonstrating how it allows the model to learn multiple aspects simultaneously. I concludesd the article by hinting at the cross-attention mechanism in the decoder, leaving room for future exploration of detailed and easy explanation in upcoming series of Deep Learning articles.







Contribution Statistics!


GitHub Repository Top Languages


LeetCode Statistics

Competencies: Because Juggling Wasn't on Resume


Description


  • Leadership Skills
  • Communication Skills
  • Team Work
  • Self Starter
  • Problem-solving Skills
  • Keen and Curios
  • Time Management
  • Problem Solving abilities




Pinned

  1. Deep-Learning-Computer-Vision---Visual-Grounding Deep-Learning-Computer-Vision---Visual-Grounding Public

    Jupyter Notebook

  2. Neural-Style-Transfer---Deep-Learning Neural-Style-Transfer---Deep-Learning Public

    Imposition of Artistic Style Images onto Content Images

    Jupyter Notebook

  3. Sentimental-Recommendation-System Sentimental-Recommendation-System Public

    Opinions based Recommendation Engine.

    Jupyter Notebook 1

  4. Feature-Analysis Feature-Analysis Public

    Analyzing variables, modeling and data cleaning to determine factors that contribute to the success rate of a movie.

    Jupyter Notebook

  5. Diabetic-Classification Diabetic-Classification Public

    Classifying a patient as Diabetic or Non-Diabetic based on Anthropometric Data.

    Jupyter Notebook

  6. Sales-Analysis Sales-Analysis Public

    Exploratory Data Analysis of the Sales of products from an Online Shop

    Jupyter Notebook