Skip to content

chrisconlon/kiltsnielsen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NielsenReader

NielsenReader defines the classes RetailReader and PanelReader to facilitate easy processing of the Kilts Center's Nielsen IQ Data.

  • RetailReader processes Retail Scanner Data
  • PanelReader processes Consumer Panel Data

The main advantages of this package are:

  1. Speed. this is meant to be at least 10x faster (and can be more than 100x faster) than pd.read_csv() or pd.read_table()
  2. Supports partial reading (you can read in specific products, specific cities, stores, etc.) without having to load the entire file
  3. It understands the Kilts/Nielsen directory structure -- so you can just download from Kilts and go.
  4. It saves highly compressed (and fast) .parquet files
  5. Optional support for various fixes/updates issued by Kilts.

These classes are built on pyarrow 10.0.0 Apache Arrow which you will need to install.

Installation

To install with pip simply type:

pip install git+https://github.com/chrisconlon/kiltsnielsen

To install manually from Github (if your pip isn't setup correctly). After cloning, navigate to the correct directory:

git clone https://github.com/chrisconlon/kiltsnielsen.git
sudo python setup.py install

Make sure your python installation has the latest pyarrow (as well as pandas and NumPy)

Using pip:

pip install pyarrow
pip install --upgrade pyarrow

Using pip (requirements.txt):

python -m pip install -r requirements.txt

Using conda:

conda intall pyarrow
conda update pyarrow

Data

Information about the data can be found at the Kilts Center's Website for the Nielsen Dataset.

Check with your institution to gain access to the data. Once you have gained access, download files as follows:

Retail Scanner Data:
  1. Construct file extracts using the Kilts File Selection System. Note you must separately gain access to the File Selection System after applying for the data.
  2. The data are available in .tgz files. Data can be downloaded by group, module, and/or year
  3. Unzip the .tgz files
Consumer Panel Data
  1. Panelist data can be downloaded directly from Globus. The data are small enough for a typical work machine. The data are available in .tgz files
  2. Unzip the .tgz files.

Importantly, make sure all files are unzipped and preserved in the original Nielsen structure before using the methods provided here. (Do not rearrange the directory structure.)

QuickStart

This shows how we process the retail scanner data for Backus Conlon Sinkinson (2021)

import pandas as pd
from  kiltsreader import RetailReader
from pathlib import Path

# Read these from your nielsen directory -- change this
cereal_dir = Path.cwd()

# save your ouptut here -- change this
save_dir = Path.cwd()

# Specify which dmas /modules /columns to keep
dmas = [506, 517, 556, 602, 751]
cereal = 1344

# supermarkets stores only (no superstores/pharmacies/etc)
channels =['F']

# read cereal -- auto-write
rr = RetailReader(cereal_dir)
rr.filter_years(drop=[2006, 2019])
rr.read_stores()
rr.filter_stores(keep_dmas=dmas, keep_channels=channels)
rr.read_products(keep_modules=[cereal])
rr.read_sales()

# we can access the underlying data from the rr object
# add the dma (city) and retailer_code (chain) info directly to sales data
# then save the data in chunks by dma_code (city)
rr.df_sales = pd.merge(rr.df_sales,
    rr.df_stores[['store_code_uc','panel_year','dma_code']],
    on=['store_code_uc','panel_year'])
rr.write_data(save_dir, stub="cereal", as_table=True, separator='dma_code')

Tutorial

from kiltsreader import RetailReader, PanelReader
  1. Locate your Nielsen Retail Scanner and Consumer Panel data separately
  2. Open Example.py
  3. Replace dir_retail and dir_panel with the locations of your Retail Scanner and Consumer Panel Data respectively
  4. Replace the KEEP_GROUPS, KEEP_MODULES, DROP_YEARS, KEEP_YEARS, KEEP_STATES, and KEEP_CHANNEL with your relevant selection
  5. Run Example.py to verify the code works.

Class Descriptions

RetailReader

class RetailReader(dir_read=path.Path.cwd(), verbose=True)

RetailReader defines the class object used to read in the Nielsen Retail Scanner Data (see above)

Parameters:
  • dir_read(pathlib Path object, optional): points to the location of the Retail Scanner Files. Should be named nielsen_extract or something similar, containing the subfolder RMS. Default is the current working directory.
  • verbose(bool): if True, prints updates after processing files. Displays size of files processed. Default is True.
Methods:
  • filter_years(): Selects years for which to process annual scanner data files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data.
  • read_products(): Reads in the set of product characteristics (and filters them).
  • read_stores(): Reads in the full stores list.
  • filter_stores(): filter stores by State, DMA, and channel-type. You MUST run read_stores() first.
  • read_rms(): the RMS data contains the upc_ver_uc information for re-used UPCs.
  • read_extra(): Populates RetailReader.df_variations with brand variations data, typically located in MasterFiles/Latest/brand_variations.tsv. Lists brand codes, descriptions, and any alternative descriptions.
  • read_sales(): this reads in the majority of the scanner data.
  • write_data(): after reading in the data, this writes the tables as .parquet files.
Objects:
  • df_products (pandas DataFrame): default empty, stores products data after processing
  • df_sales (pandas DataFrame): default empty, stores sales data after processing
  • df_stores (pandas DataFrame): default empty, stores store data after processing
  • df_rms (pandas DataFrame): default empty, stores RMS versions data after processing
  • df_extra (pandas DataFrame): default empty, stores extra product data after processing
  • all_years (list): list of years for which data will be processed
  • files_product (pathlib Path object): stores name of product characteristic file
  • files_rms (list of pathlib Path objects): stores names of annual rms_versions files
  • files_stores (list of pathlib Path objects): stores names of annual stores files
  • files_extra (list of pathlib Path objects): stores names of annual extra files
  • files_sales (list of pathlib Path objects)stores names of annual sales files
  • all_years (list): list of years included in data. Updates with filtering

Available Functions in the RetailReader Class

Functions also described with docstrings in the NielsenReader.py file

RetailReader.filter_years(keep=None, drop=None): Selects years for which to process annual sales files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data. Populates RetailReader.all_years

  • keep(list of integers, optional): list of years to keep, e.g. range(2004, 2013). Can only include years that are already present in the data, e.g. specifying keep=[1999] will result in an empty set of years
  • drop(list of integers, optional): list of years to remove, e.g. [2006, 2009, 2013]

RetailReader.filter_sales(keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None): Selects product groups (outer category) and product modules (inner category) for which to process annual sales files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data.

  • keep_groups(list of integers, optional): list of product groups to keep, e.g. keep_groups=[1508, 1048]
  • drop_groups(list of integers, optional): list of product groups to exclude, e.g. drop_groups = [1046]. Takes precedence if there is any overlap with keep_groups
  • keep_groups(list of integers, optional): list of product modules to keep, e.g. keep_modules=[1481, 1482]
  • drop_groups(list of integers, optional): list of product modules to exclude, e.g. drop_groups = [1483]. Takes precedence if there is any overlap with keep_module

RetailReader.read_rms(): Populates RetailReader.df_rms Processes the annual RMS versions files, which map reused UPCs to the appropriate version based on year

RetailReader.read_products(upc_list=None, keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None): Populates RetailReader.df_products. Reads in the set of product characteristics, typically located in Master_Files/Latest/products.tsv. Optionally elects product groups (outer category) and product modules (inner category) for which to process annual sales files.

Note that the function does NOT carry over the filtered set of groups and modules from RetailReader.filter_sales(). The RetailReader.read_products() function is redundant with the PanelReader.read_products() function, and therefore allows the user to read in the full set of products and their characteristics even while reading only a subset of sales.

  • upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g. upc_list=[002111039080, 009017445929] (leading zeros not required)
  • keep_groups(list of integers, optional): list of product groups to keep, e.g. keep_groups=[1508, 1048]
  • drop_groups(list of integers, optional): list of product groups to exclude, e.g. drop_groups = [1046]. Takes precedence if there is any overlap with keep_groups
  • keep_groups(list of integers, optional): list of product modules to keep, e.g. keep_modules=[1481, 1482]
  • drop_groups(list of integers, optional): list of product modules to exclude, e.g. drop_groups = [1483]. Takes precedence if there is any overlap with keep_module

RetailReader.read_extra(years=None, upc_list=None) Populates RetailReader.df_extra Selects annual Products Extra files for all post-filtering years. Redundant with PanelReader.read_extra() Product group and module filtering are not possible. Note that UPCs may be repeated for different years. Use RMS versions to select appropriate years. Differences between multiple years for a single UPC may be due not to actual product changes but rather due to Nielsen filling in previously missing data. See Nielsen documentation for an in-depth description

  • upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g. upc_list=[002111039080, 009017445929] (leading zeros not required)
  • years(list of integers, optional): selects years for which to. Note that previous year-filtering of the RetailReader object will carry over unless a new set of years is specified.

RetailReader.read_stores(): Populates RetailReader.df_stores Stores files are common to all product groups and modules, so processing will be unaffected by RetailReader.filter_years or RetailReader.filter_sales

RetailReader.filter_stores(keep_dma=None, drop_dma=None, keep_states=None, drop_states=None, keep_channel=None, drop_dma=None): Updates RetailReader.df_stores

  • keep_dmas(list of integers, optional): list of DMAs (Designated Market Areas) to keep, e.g. keep_dma=[801, 503]
  • drop_dmas(list of integers, optional): list of DMAs (Designated Market Areas) to exclude, e.g. drop_dma=[602]. Takes precedence if there is any overlap with keep_dma.
  • keep_states(list of strings, optional): list of states to keep, with list in two-character format, e.g. keep_states=['TX', 'CA']
  • drop_states(list of integers, optional): list of states to exclude, with list in two-character format, e.g. drop_states=['NJ', 'NY'] Takes precedence if there is any overlap with keep_states.
  • keep_channels(list of characters, optional): list of channels (store types) to keep, e.g. keep_channels=['F', 'G']. See Nielsen documentation for explanation of channels and list of options.
  • drop_channels(list of characters, optional): list of channels (store types) to exclude, e.g. drop_channels = ['C']. Takes precedence if there is any overlap with keep_channels. See Nielsen documentation for explanation of channels and list of options.

Note: Must be run AFTER RetailReader.read_stores(). Pre-filtering is not possible.

RetailReader.read_sales(incl_promo = True): This is the main function to read scanner data.

  • Populates RetailReader.df_sales
  • Reads in the weekly, store x upc level sales data, post-filter if any have been applied.
  • Uses pyarrow methods to filter and read the data to minimize memory and time use.
  • Warning: May still require large amounts of memory/CPU.
  • incl_promo(boolean, optional): Setting to False skips the promo and display fields.

RetailReader.write_data(dir_write = path.Path.cwd(), stub = 'out', compr = 'brotli', as_table = False, separator = 'panel_year') Writes the pandas DataFrames of the RetailReader class to parquet format (see class description abvove).

  • dir_write(pathlib Path object, optional): folder within which to write the parquets. Defalt is current working directory.
  • stub(str): initial string to name all files. Files will be named 'stub'_'[file type].parquet', e.g. 'out_stores.parquet'
  • compr(str): type of compression used for generating parquets. Default is brotli
  • as_table(bool): whether to write as pyarrow separated row-tables. See Example.py for instance of how to write and read row-groups. Requires a separator to generate rows for the row-tables. Useful if you seek to preserve space when reading in files by using only one row-group at a time.
  • separator(column name): variable on which to separate row-groups when saving as a pyarrow table. Note that [for now] the separator must be common to all files. Default is panel_year. If separator is not present in the file, it cannot be saved as a pyarrow table. If you are looking to save just a single file with a specific separator, modify the following snippet:
RetailReader.read_{FILE_TYPE}()
RR.write_data(dir_write, separator = {VARIABLE_NAME})
Private Methods

RetailReader.get_module(file_sales=): returns product module (inner category) corresponding to particular sales file.

  • file_sales (pathlib Path object): Retail Scanner sales file, e.g. 1046_2006.tsv

RetailReader.get_group(file_sales=): returns product group (outer category) corresponding to particular sales file.

  • file_sales (pathlib Path object): Retail Scanner sales file, e.g. 1046_2006.tsv

PanelReader

class PanelReader(dir_read=path.Path.cwd(), verbose=True) PanelReader defines the class object used to read in the Nielsen Consumer Panel data.

Parameters:
  • dir_read(pathlib Path object, optional): points to the location of the Consumer Panel data files. Likely named Panel or something similar. Subfolders should be years. Default is the current working directory.
  • verbose(bool): if True, prints updates after processing files. Displayes size of files processed. Default is true.
Methods:
  • filter_years(): Selects years for which to process annual panelist, purchase, trips, and extra files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data.
  • read_retailers(): Reads the retailers file, which list retailer codes and channels (and filters them).
  • read_products(): Reads in the set of product characteristics (and filters them).
  • read_extra(): Selects annual Products Extra files for all post-filtering years.
  • read_variations(): Populates PanelReader.df_variations with brand variations data, typically located in MasterFiles/Latest/brand_variations.tsv. Lists brand codes, descriptions, and any alternative descriptions.
  • read_year(): this does most of the work, and reads in panelists, trips, purchases for a single year
  • read_annual(): this is the main function repeatedly calls read_year() to read in panelists, trips, purchases for multiple years
  • write_data(): after reading in the data, this writes the tables as .parquet files
  • read_revised_panelists(): Corrects the Panelist data using errata provided by Nielsen for issues not yet incorporated into the data as of October 2021. Must have already called PanelReader.read_annual(), PanelReader.read_products(), PanelReader.read_variations(), PanelReader.read_retailers()
  • process_open_issues(): Corrects the product extra and panelist data using errata provided by Nielsen for two specific issues:
    • ExtraAttributes_FlavorCode: adds missing flavor code and flavor description to 2010 products extra characteristics file
    • Panelist_maleHeadBirth_femaleHeadBirth: corrects issue with male head of household birth month
Objects:
  • df_products (pandas DataFrame): default empty, stores products data after from Master Files processing
  • df_variations (pandas DataFrame): default empty, stores brand_variations data from Master Files after processing
  • df_retailers (pandas DataFrame): default empty, stores retailers data from Master Files after processing
  • df_trips (pandas DataFrame): default empty, stores annual trips data after processing
  • df_panelists (pandas DataFrame): default empty, stores annual panelists data after processing
  • df_purchases (pandas DataFrame): default empty, stores annual purchases after processing
  • df_extra (pandas DataFrame): default empty, stores annual products extra characteristics data after processing
  • files_annual (list of pathlib Path objects)
  • files_product (list of pathlib Path objects): stores name of unrevised products file
  • files_variation (list of pathlib Path objects): stores name of unrevised brand variations file
  • files_retailers (list of pathlib Path objects): stores name of unrevised retailers file
  • files_trips (list of pathlib Path objects): stores names of annual tripes files
  • files_panelists (list of pathlib Path objects): stores names of annual panelists files
  • files_purchases (list of pathlib Path objects): stores names of annual purchases files
  • files_extra (list of pathlib Path objects): stores names of annual products extra files
  • all_years (list): list of years included in data. Updates with filtering

Available Functions in the RetailReader Class

PanelReader.filter_years(keep=None, drop=None) Selects years for which to process annual panelist, purchase, trips, and extra files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data. Updates PanelReader.all_years

  • keep(list of integers, optional): list of years to keep, e.g. range(2004, 2013). Can only include years that are already present in the data, e.g. specifying keep=[1999] will result in an empty set of years
  • drop(list of integers, optional): list of years to remove, e.g. [2006, 2009, 2013]

PanelReader.read_retailers(): Populates PanelReader.df_retailers Processes the Master retailers file, which list retailer codes and channels.

Note that the file may be later revised following a call to PanelReader.read_revised_panelists() or PanelReader.process_open_issues()

PanelReader.read_products(upc_list=None, keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None) Populates PanelReader.df_products, which should be identical to RetailReader.df_products following a call to RetailReader.read_products().

Reads in the set of product characteristics, typically located in Master_Files/Latest/products.tsv. Optionally elects product groups (outer category) and product modules (inner category) for which to process annual sales files.

Note that the file may be later revised following a call to PanelReader.read_revised_panelists() or PanelReader.process_open_issues(). Such updating is only possible through PanelReader; the Retail Scanner files do not contain any revisions.

  • upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g. upc_list=[002111039080, 009017445929] (leading zeros not required)
  • keep_groups(list of integers, optional): list of product groups to keep, e.g. keep_groups=[1508, 1048]
  • drop_groups(list of integers, optional): list of product groups to exclude, e.g. drop_groups = [1046]. Takes precedence if there is any overlap with keep_groups
  • keep_groups(list of integers, optional): list of product modules to keep, e.g. keep_modules=[1481, 1482]
  • drop_groups(list of integers, optional): list of product modules to exclude, e.g. drop_groups = [1483]. Takes precedence if there is any overlap with keep_module

PanelReader.read_extra(years=None, upc_list=None) Populates PanelReader.df_extra Selects annual Products Extra files for all post-filtering years. Redundant with RetailReader.read_extra(). Does not have any revisions as of October 2021. Product group and module filtering are not possible; only UPC and year filtering are available. Note that UPCs may be repeated for different years. Use RMS versions to select appropriate years. Differences between multiple years for a single UPC may be due not to actual product changes but rather due to Nielsen filling in previously missing data. See Nielsen documentation for an in-depth description.

  • years (list of integers, optional): selects years for which to. Note that previous year-filtering of the PanelReader object will carry over unless a new set of years is specified
  • upc_list (list of integers, optional): list of Universal Product Codes to keep, e.g. upc_list=[002111039080, 009017445929] (leading zeros not required)

PanelReader.read_variations() Populates PanelReader.df_variations with brand variations data, typically located in MasterFiles/Latest/brand_variations.tsv. Lists brand codes, descriptions, and any alternative descriptions.

PanelReader.read_year(year, keep_dmas=None, drop_dmas=None, keep_states=None, drop_states=None,) Populates PanelReader.df_panelists, PanelReader.df_purchases, and PanelReader.df_trips Processes a single year of annual data (panelists, purchases, and trips data). Useful if you seek to only process one year at a time; otherwise use read_annual as described below

  • year (int): single year to process
  • keep_dmas (list of integers): list of DMAs (Designated Market Areas) to keep, e.g. keep_dma=[801, 503]
  • drop_dmas (list of integers): list of DMAs (Designated Market Areas) to exclude, e.g. drop_dma=[602]. Takes precedence if there is any overlap with keep_dma.
  • keep_states (list of strings): list of states to keep, with list in two-character format, e.g. keep_states=['TX', 'CA']
  • drop_states (list of strings): list of states to exclude, with list in two-character format, e.g. drop_states=['NJ', 'NY'] Takes precedence if there is any overlap with keep_states.

PanelReader.read_annual(keep_states=None, drop_states=None, keep_dmas=None, drop_dmas=None) Processes all years (post-PanelReader.filter_years()) with repeated calls to PanelReader.read_year()

  • keep_dmas (list of integers): list of DMAs (Designated Market Areas) to keep, e.g. keep_dma=[801, 503]
  • drop_dmas (list of integers): list of DMAs (Designated Market Areas) to exclude, e.g. drop_dma=[602]. Takes precedence if there is any overlap with keep_dma.
  • keep_states (list of strings): list of states to keep, with list in two-character format, e.g. keep_states=['TX', 'CA']
  • drop_states (list of strings): list of states to exclude, with list in two-character format, e.g. drop_states=['NJ', 'NY'] Takes precedence if there is any overlap with keep_states.

PanelReader.write_data(dir_write = path.Path.cwd(), stub = 'out', compr = 'brotli', as_table = False, separator = 'panel_year') Writes the pandas DataFrames of the RetailReader class to parquet format (see class description abvove).

  • dir_write(pathlib Path object, optional): folder within which to write the parquets. Defalt is current working directory.
  • stub(str): initial string to name all files. Files will be named 'stub'_'[file type].parquet', e.g. 'out_stores.parquet'
  • compr(str): type of compression used for generating parquets. Default is brotli
  • as_table(bool): whether to write as pyarrow separated row-tables. See Example.py for instance of how to write and read row-groups. Requires a separator to generate rows for the row-tables. Useful if you seek to preserve space when reading in files by using only one row-group at a time.
  • separator(column name): variable on which to separate row-groups when saving as a pyarrow table. Note that [for now] the separator must be common to all files. Default is panel_year. If separator is not present in the file, it cannot be saved as a pyarrow table. If you are looking to save just a single file with a specific separator, modify the following snippet:
RetailReader.read_{FILE_TYPE}()
RR.write_data(dir_write, separator = {VARIABLE_NAME})

PanelReader.read_revised_panelists() Updates PanelReader.df_products, PanelReader.df_variations, PanelReader.df_retailers, and PanelReader.df_panelists Corrects the Panelist data using errata provided by Nielsen for issues not yet incorporated into the data as of October 2021. Must have already called PanelReader.read_annual(), PanelReader.read_products(), PanelReader.read_variations(), PanelReader.read_retailers()

PanelReader.process_open_issues() Updates PanelReader.df_panelists and PanelReader.df_extra Corrects the product extra and panelist data using errata provided by Nielsen for two specific issues:

  • ExtraAttributes_FlavorCode: adds missing flavor code and flavor description to 2010 products extra characteristics file
  • Panelist_maleHeadBirth_femaleHeadBirth: corrects issue with male head of household birth month

About

Python utilities for working with Kilts-Nielsen files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages