Skip to content

Applying fraud analysis in the NY Property Tax dataset with 1048575 observations and 30 variables

Notifications You must be signed in to change notification settings

ChengChen-Steven/Anomaly-Detection-of-NY-Property-Data

Repository files navigation

Fraud-Detection-of-NY-Property-Data

Applying fraud analysis in the NY Property Tax dataset with 1048575 observations and 30 variables

====

File name: NY_property_data Source
Data Provided by: Department of Finance (DOF)
Dataset Owner: NYC OpenData
Category: Housing & Development
Date Created: September 2, 2011 Metadata Last Updated: September 5, 2014
Data volume: 1048575 Fields: 30 fields, 16 continuous, 12 categorical, 2 text
Details variables: RECORD, BBLE, BLOCK, LOT, EASEMENT, OWNER, BLDGCL, TAXCLASS, LTFRONT, LTDEPTH, STORIES, FULLVAL, AVLAND, AVTOT, EXLAND, EXTOT, EXCD1, STADDR, ZIP, EXMPTCL, BLDFRONT, BLDDEPTH, AVLAND2, EXLAND2, EXLAND2, EXTOT2, EXCD2, PERIOD, YEAR, VALTYPE.

Purpose of the project

  • Build metrics and detect the potential fraud records

Outline of Approach

Step1: Data Cleaning
  • Data Quality Report (DQR)
  • Create 50 more insightful variables based on original variables
  • Partition based on 7 key metrics
  • Dealing with missing value, Z-scale
Step2: PCA (Principle Componenet Analysis)
  • PCA (Visualization, Decide to use 13 PCs)
Step3: Set Fraud Score:
1. Euclidean Distance
  • Based on projection of original features on the 13 PCs' directions
  • Calculate the euclidean distance to the origin as the fraud score
2. Auto-encoder(Neural Network Algorithm)
  • Use h2o package to implement auto-encoder
  • Output results from the PCA serves as the input
  • Set two hidden layers, both five features
  • Calculate MSE of each record as the fraud score
3. Comparison
  • 76% overlapping in the highest 10000 fraud scores between Euclidean Distance and Auto-encoder          
4. Discover insights
  • Check the 30 records with top fraud scores and identify fraud patterns within them.

==== Keep updating...

About

Applying fraud analysis in the NY Property Tax dataset with 1048575 observations and 30 variables

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages