Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exported file always empty data on Version 3.0.0 #1149

Open
fishfree opened this issue Sep 18, 2023 · 9 comments
Open

Exported file always empty data on Version 3.0.0 #1149

fishfree opened this issue Sep 18, 2023 · 9 comments

Comments

@fishfree
Copy link

Acturally I installed SFM UI the docker way. I install the 3.0.0 to adopt Twitter 2.0 API due to the termination of the free Twitter 1.1 API.
On the collection page, it shows results:
image

After exporting, it shows:
image

However, I open the exported file, always only has the headers, no content:
image

@fishfree
Copy link
Author

fishfree commented Sep 19, 2023

@dolsysmith Hi Dolsy, sorry to bother you. I found you upgraded the docker-compose.yml to v 3.0.0. Could you pls have a look at this issue? And could you also pls confirm if your exporter working as expected. Thank you very much!

I notcied the docker-compose logs -f ouput:

twitterrestexporter2_1          | 2023-09-20 05:39:49,867: sfmutils.warc_iter --> Iterating over /sfm-collection-set-data/collection_set/5446e7cd861d4a24a759b1f5bdce278e/767b28d78b164cf7b7ca83257fee81d4/2023/09/17/23/5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz
twitterrestexporter2_1          | 2023-09-20 05:39:49,870: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 0 records. Yielded 0 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,105: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 3 records. Yielded 100 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,271: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 4 records. Yielded 200 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,521: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 5 records. Yielded 300 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,660: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 6 records. Yielded 400 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,792: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 7 records. Yielded 500 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,906: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 8 records. Yielded 600 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,005: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 9 records. Yielded 700 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,064: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 10 records. Yielded 798 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,081: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 10 records. Yielded 800 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,258: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 13 records. Yielded 900 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,475: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 15 records. Yielded 1000 items.
twitterrestexporter2_1          | 2023-09-20 05:39:52,220: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 20 records. Yielded 1448 items.
twitterrestexporter2_1          | 💔 ERROR: 1 Unexpected items in data! 
twitterrestexporter2_1          | Are you sure you specified the correct --input-data-type?
twitterrestexporter2_1          | If the object type is correct, add extra columns with:
twitterrestexporter2_1          | --extra-input-columns "public_metrics.bookmark_count"
twitterrestexporter2_1          | Skipping entire batch of 6982 tweets!
twitterrestexporter2_1          | 2023-09-20 05:39:54,192: twarc --> CSV Unexpected Data: "public_metrics.bookmark_count". Expected 83 columns, got 64. Skipping entire batch of 6982 tweets!
twitterrestexporter2_1          | 2023-09-20 05:39:54,259: __main__ --> DataFrame contains 0 rows.

@dolsysmith
Copy link
Contributor

@fishfree Thanks for posting your docker logs. I am unfortunately unable to test this for myself, because I don't currently have a working Twitter v.2 API key. But I believe the issue stems from the twarc-csv package, which generates the CSV files from the Twitter JSON. Twitter has been rather relentlessly tweaking their API schema, and whenever they add or drop a field from the JSON, twarc-csv needs to be updated.

I'm not sure whether the latest version of twarc-csv will handle this, but you could try the following:

  1. Open a bash session in the exporter container: docker exec -it twitterrestexporter2_1 /bin/bash.
  2. Run pip install --upgrade twarc-csv
  3. Exit the bash shell and stop, but do not delete, the container: docker stop twitterrestexporter2_1.
  4. Restart the container: docker start twitterrestexporter2_1.

With luck, when the container restarts, it will use the upgraded version of twarc-csv. If that doesn't work, you might try exporting the full JSON of the Tweets from SFM (since the full JSON export does not rely on twarc-csv) and using (the latest version of) twarc-csv outside the containers, at the command line, to convert the JSON to CSV. At the command line, you can even pass an argument to twarc-csv, as suggested by the error in the logs, which should correct for the issue: --extra-input-columns "public_metrics.bookmark_count".

Eventually, I should have time to push a new release of SFM with the latest twarc-csv library in the Docker images. But that probably won't be for another month or so.

In the meantime, I hope that helps!

@fishfree
Copy link
Author

fishfree commented Sep 21, 2023

@dolsysmith Thank you very much for your tip! I rebuid the image locally with twarc-csv 0.7.2. It works now.

@fishfree
Copy link
Author

@dolsysmith Sorry to bother you again. Now in a new server, I deployed sfm-docker with following docker-compose.yml:

version: "2"
services:
    db:
        image: gwul/sfm-ui-db:3.0.0
        environment:
            - POSTGRES_PASSWORD=${SFM_POSTGRES_PASSWORD}
            - TZ
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always
    mq:
        image: gwul/sfm-rabbitmq:3.0.0
        hostname: mq
        ports:
            # Opens up the ports for RabbitMQ management
            - "${SFM_RABBITMQ_MANAGEMENT_PORT}:15672"
        environment:
            - RABBITMQ_DEFAULT_USER=${SFM_RABBITMQ_USER}
            - RABBITMQ_DEFAULT_PASS=${SFM_RABBITMQ_PASSWORD}
            - TZ
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always
    # These containers will exit on startup. That's OK.
    data:
        image: gwul/sfm-data:3.0.0
        volumes:
             - ${DATA_VOLUME_MQ}
             - ${DATA_VOLUME_DB}
             - ${DATA_VOLUME_EXPORT}
             - ${DATA_VOLUME_CONTAINERS}
             - ${DATA_VOLUME_COLLECTION_SET}
             # For SFM instances installed on 2.3.0 or earlier
             # - ${DATA_VOLUME_FORMER_COLLECTION_SET}
             # - ${DATA_VOLUME_FORMER_EXPORT}
        environment:
            - TZ
            - SFM_UID
            - SFM_GID
    processingdata:
        image: debian:buster
        command: /bin/true
        volumes:
             - ${PROCESSING_VOLUME}
        environment:
            - TZ
    ui:
        image: gwul/sfm-ui:3.0.0
        ports:
            - "${SFM_PORT}:8080"
        links:
            - db:db
            - mq:mq
        environment:
            - SFM_DEBUG=False
            - SFM_APSCHEDULER_LOG=INFO
            - SFM_UI_LOG=INFO
            # This adds a 5 minute schedule option to speed testing.
            - SFM_FIVE_MINUTE_SCHEDULE=False
            # This adds a 100 item export segment for testing.
            - SFM_HUNDRED_ITEM_SEGMENT=False
            - TZ
            - SFM_SITE_ADMIN_NAME
            - SFM_SITE_ADMIN_EMAIL
            - SFM_SITE_ADMIN_PASSWORD
            - SFM_EMAIL_USER
            - SFM_EMAIL_PASSWORD
            - SFM_EMAIL_FROM
            - SFM_SMTP_HOST
            - SFM_HOST=${SFM_HOSTNAME}:${SFM_PORT}
            - SFM_HOSTNAME
            - SFM_CONTACT_EMAIL
            - TWITTER_CONSUMER_KEY
            - TWITTER_CONSUMER_SECRET
            - WEIBO_API_KEY
            - WEIBO_API_SECRET
            - TUMBLR_CONSUMER_KEY
            - TUMBLR_CONSUMER_SECRET
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_RABBITMQ_MANAGEMENT_PORT
            - SFM_POSTGRES_PASSWORD
            - SFM_POSTGRES_HOST
            - SFM_POSTGRES_PORT
            # To have some test accounts created.
            - LOAD_FIXTURES=False
            - SFM_REQS=release
            - DATA_VOLUME_THRESHOLD_DB
            - DATA_VOLUME_THRESHOLD_MQ
            - DATA_VOLUME_THRESHOLD_EXPORT
            - DATA_VOLUME_THRESHOLD_CONTAINERS
            - DATA_VOLUME_THRESHOLD_COLLECTION_SET
            - PROCESSING_VOLUME_THRESHOLD
            - DATA_SHARED_USED
            - DATA_SHARED_DIR
            - DATA_THRESHOLD_SHARED
            - SFM_UID
            - SFM_GID
            - SFM_INSTITUTION_NAME
            - SFM_INSTITUTION_LINK
            - SFM_ENABLE_COOKIE_CONSENT
            - SFM_COOKIE_CONSENT_HTML
            - SFM_COOKIE_CONSENT_BUTTON_TEXT
            - SFM_ENABLE_GW_FOOTER
            - SFM_MONITOR_QUEUE_HOUR_INTERVAL
            - SFM_SCAN_FREE_SPACE_HOUR_INTERVAL
            - SFM_WEIBO_SEARCH_OPTION
            - SFM_USE_HTTPS
            - SFM_USE_ELB
            - TWITTER_COLLECTION_TYPES
            # For ngninx-proxy
            - VIRTUAL_HOST=${SFM_HOSTNAME}
            - VIRTUAL_PORT=${SFM_PORT}
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
            - processingdata
#       # Comment out volumes section if SFM data is stored on mounted filesystems and DATA_SHARED_USED is False.
        volumes:
            - "${DATA_SHARED_DIR}:/sfm-data-shared"
        restart: always
    uiconsumer:
        image: gwul/sfm-ui-consumer:3.0.0
        links:
            - db:db
            - mq:mq
            - ui:ui
        environment:
            - SFM_DEBUG=False
            - SFM_APSCHEDULER_LOG=INFO
            - SFM_UI_LOG=INFO
            - TZ
            - SFM_SITE_ADMIN_NAME
            - SFM_SITE_ADMIN_EMAIL
            - SFM_SITE_ADMIN_PASSWORD
            - SFM_EMAIL_USER
            - SFM_EMAIL_PASSWORD
            - SFM_EMAIL_FROM
            - SFM_SMTP_HOST
            - SFM_HOST=${SFM_HOSTNAME}:${SFM_PORT}
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_POSTGRES_PASSWORD
            - SFM_POSTGRES_HOST
            - SFM_POSTGRES_PORT
            - SFM_REQS=release
            - SFM_UID
            - SFM_GID
            - SFM_USE_HTTPS
        volumes_from:
            - data
            - processingdata
        restart: always
# Twitter
    twitterrestharvester:
        image: gwul/sfm-twitter-rest-harvester:3.0.0
        links:
            - mq:mq
        environment:
            - TZ
            - DEBUG=False
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_REQS=release
            - HARVEST_TRIES=${TWITTER_REST_HARVEST_TRIES}
            - SFM_UID
            - SFM_GID
            - PRIORITY_QUEUES=False
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always
    twitterpriorityrestharvester:
        image: gwul/sfm-twitter-rest-harvester:3.0.0
        links:
            - mq:mq
        environment:
            - TZ
            - DEBUG=False
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_REQS=release
            - HARVEST_TRIES=${TWITTER_REST_HARVEST_TRIES}
            - SFM_UID
            - SFM_GID
            - PRIORITY_QUEUES=True
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always

    twitterrestexporter2:
        image: myown/sfm-twitter-rest-exporter-v2:3.0.0
        links:
            - mq:mq
            - ui:api
        environment:
            - TZ
            - DEBUG
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_REQS=${TWITTER_REQS}
            - SFM_UID
            - SFM_GID
            - SFM_UPGRADE_REQS=${UPGRADE_REQS}
            - MAX_DATAFRAME_ROWS
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always

# PROCESSING
    # This container will exit on startup. That's OK.
    processing:
        image: gwul/sfm-processing:master
        links:
            - ui:api
        environment:
            - TZ
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data:ro
            - processingdata

Among these images, I built the myown/sfm-twitter-rest-exporter-v2:3.0.0 myself accoring to my post above. When exporting data, however, the docker-compose logs -f says errors as below:

 | 💔 ERROR: 1 Unexpected items in data! 
twitterrestexporter2_1          | Are you sure you specified the correct --input-data-type?
twitterrestexporter2_1          | If the object type is correct, add extra columns with:
twitterrestexporter2_1          | --extra-input-columns "author.public_metrics.like_count"
twitterrestexporter2_1          | Skipping entire batch of 10383 tweets!
twitterrestexporter2_1          | 2023-11-14 17:49:48,748: twarc --> CSV Unexpected Data: "author.public_metrics.like_count". Expected 84 columns, got 58. Skipping entire batch of 10383 tweets!
twitterrestexporter2_1          | 2023-11-14 17:49:48,811: __main__ --> DataFrame contains 0 rows.
twitterrestexporter2_1          | 2023-11-14 17:49:48,812: sfmutils.consumer --> Sending message to sfm_exchange with routing_key export.status.twitter2.twitter_user_timeline_2. The body is: {

@fishfree fishfree reopened this Nov 16, 2023
@dolsysmith
Copy link
Contributor

dolsysmith commented Nov 16, 2023

Hi @fishfree, my guess is that the Twitter data model has changed again, and that twarc-csv needs another update. Since there hasn't been another release since 0.7.2, I would recommend opening an issue on the twarc-csv repo.

It might be possible to modify the SFM twitter-exporter code to check for these errors and respond accordingly; I'll keep this issue open as a reminder to look at this in a future sprint.

Thanks for letting us know.

@fishfree
Copy link
Author

fishfree commented Nov 17, 2023

@dolsysmith There is indeed public_metrics.like_count here, rather than author.public_metrics.like_count in sfm-docker error log. And it is also public_metrics.like_count in the Twitter official doc. So is it still the problem of our code?

@dolsysmith
Copy link
Contributor

I wouldn't be surprised if Twitter had not updated their official docs. I don't think the API is much of a priority for them right now. So yes, I imagine the API has changed, and that has broken the twarc-csv dataframe_converter code.

@fishfree
Copy link
Author

@dolsysmith Thank you! Then is there any alternative way to export harveested data as CSV files?

@dolsysmith
Copy link
Contributor

@fishfree I would approach it in two steps.

  1. Use SFM's command-line tools to extract the Tweet JSON from the downloaded WARC files. T
  2. Use twarc-csv on the extracts JSON with the command-line parameter --extra-input-columns, which should allow you to specify the column that causing the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants