Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: download zip files #75

Closed
ghost opened this issue Jul 6, 2016 · 10 comments
Closed

Question: download zip files #75

ghost opened this issue Jul 6, 2016 · 10 comments

Comments

@ghost
Copy link

ghost commented Jul 6, 2016

For the zip files packed at 2016/06/01, I downloaded the file named "2004q1/drug-event-0001-of-0002.json.zip", and found that the first three records of the file have strange receive_date:
"receivedate": "20100729",
"receivedate": "20101129",
"receivedate": "20110614".

Are those zip files packed randomly or by some rules?

@HansNelsen
Copy link
Contributor

The drug event downloads are partitioned on an internal key called @timestamp which is simply mapped to the drug event key receiptdate. The records should contain all data with a receiptdate between 20040101 to 20040401.

@ghost ghost closed this as completed Jul 6, 2016
@ghost
Copy link
Author

ghost commented Jul 13, 2016

Follow-up questions:

do you keep or remove earlier reports from the same safetyid?
or do you combine all reports from the same safetyid?

Thanks

@HansNelsen
Copy link
Contributor

The reports are processed in order first to latest and only the latest one is kept.

@ghost
Copy link
Author

ghost commented Jul 22, 2016

https://api.fda.gov/drug/event.json?search=receivedate:[20040101+TO+20140131]+AND+safetyreportid:4261828
has 1 match with
"receivedate": "20040102",
"receiptdate": "20031222"

In the original ASCII 2004Q1, the id has only one record with
FDA_DT=20140103
MFR_DT=20031222

Does receipt date also come from MFR_DT? or the newer FDA_DT?

My original guess is caused by the change of data structure
In the 2004 q1 file, the 7th column is MFR_DT, and the 8th column is FDA_DT
In the 2014 q1 file, the 7th column is init_fda_dt, and the 8th column is fda_dt
however init_fda_dt is not the newest report date.

@HansNelsen
Copy link
Contributor

I do not know the answer to this one. The answer might be in the pdf files that come with the downloads. There is a sort of data dictionary in there that may prove useful in answering this questions. You could also ask the openFDA team, since they are in regular contact with the FDA and they can forward your question on to the internal group responsible for the drug event data.

Good luck.

@dkrylovsb
Copy link
Collaborator

@yunstat, the pipeline pulls drug adverse events from FAERS XML/SGML files, not the ASCII ones (the latter is only used for report id to case number conversion in some cases). So in the example you provided the dates come from AERS_SGML_2004q1.zip/sgml/ADR04M01.SGM:

   <receivedateformat>102</receivedateformat>
   <receivedate>20040102</receivedate>
   <receiptdateformat>102</receiptdateformat>
   <receiptdate>20031222</receiptdate>

So the drug event data should not be affected by the change in the ASCII structure you described.

@ghost
Copy link
Author

ghost commented Jul 25, 2016

Thank you. After seeing the earlier response, I checked the NTS files to search for the reason of having smaller receipt_date.

The FAERS system is started at 2012 Q4. Before that, the system is called AERS, or LAERS.
From 2012Q4 to 2016Q1, the NTS file describes

  • A.1.6b receivedate = Date report was first received by FDA (Initial FDA Received Date)
  • A.1.7b receiptdate = Date of most recent report received by FDA

From 2004Q1 to 2012Q3, the NTS file describes

  • A.1.6b receivedate = FDA receive date
  • A.1.7b receiptdate = Manufacture's date of receipt of initial information. For non-mfr reports, receiptdate is repeated.

Therefore, receivedate >= receiptdate before 2012Q3, and receivedate <= receiptdate after 2012Q4. This solved my questions about having receiptdate smaller than receivedate at 2004.

I have been thinking about the effect on aggregated counts. This is like a shift of time frame. My observation on the effect is small.

Since openfda has been using ASCII files, there date fields in ASCII, like FDA_dt etc, that might be useful to anchor the reports.

@ghost
Copy link
Author

ghost commented Jul 25, 2016

The format of FAERS xml files is recommended by the DTD in ICH E2b/M2 V 2.1 standard.
[http://estri.ich.org/e2br22/index.htm]
In the document "Electronic Transmission of Individual Case Safety Reports Message Specification
Document Version 2.3 November 9, 2000", definitions are,

  • A.1.6b receivedate = Date report was first received from source
  • A.1.7b receiptdate = Date of receipt of the most recent information for this report

By definition, I would think receiptdate >= receivedate.
At page 22, the same document provides an example:

<receivedateformat>102</receivedateformat>
<receivedate>19980102</receivedate>
<receiptdateformat>102</receiptdateformat>
<receiptdate>19970103</receiptdate>

This example has receiptdate smaller than receivedate.

The document, "MAINTENANCE OF THE ICH GUIDELINE ON CLINICAL SAFETY DATA MANAGEMENT : DATA ELEMENTS FOR TRANSMISSION OFINDIVIDUAL CASE SAFETY REPORTS E2B(R2)", provides some more details:

A.1.6 Date report was first received from source
User Guidance:
For senders dealing with initial information, this should always be the date the information was received from the primary source. When retransmitting information received from another regulatory agency or another company or any other secondary source, A.1.6 is the date the retransmitter first received the information. A full precision date should be used (i.e., day, month, year).

A.1.7 Date of receipt of the most recent information for this report
User Guidance:
Because reports are sent at different times to multiple receivers, the initial/follow up status is dependent upon the receiver. For this reason an item to capture follow-up status is not included. However, the date of receipt of the most recent information taken together with the “sender identifier” (A.3.1.2) and “sender’s (case) report unique identifier” (A.1.0.1) provide a mechanism for each receiver to identify whether the report being transmitted is an initial or follow-up report. For this reason these items are considered critical for each transmission. A full precision date should be used (i.e., day, month, year).

The AERS system started way earlier before 2001. The change of definition on XML keys may have its own historical reasons. Even now, I am still not sure the receivedate in FAERS can be defined as "the date the retransmitter first received the information". This retransmitter-first-received-date may not always available, for example, the retransmitter may not have the date in record, or have a lot of missing data. In many situations, the currently used receivedate is still a reasonably good and quick solution for monitoring the trend.

@cerdman
Copy link
Contributor

cerdman commented Jul 25, 2016

Note that E2B(R3) is currently in use and not v2 - http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM275638.pdf

@pozz82
Copy link

pozz82 commented Jul 27, 2016

@cerdman I just downloaded 2016q1 XML's zip file, and in both PDF and XML files it is stated that version 2.1 is used, not v3.

PDF:

is compliant with the DTD DCL files that are published as part of the ICH
E2b/M2 version 2.1 standard

XML:

2.1

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants