Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting empty data #169

Open
ihabpalamino opened this issue Jul 12, 2023 · 4 comments
Open

getting empty data #169

ihabpalamino opened this issue Jul 12, 2023 · 4 comments

Comments

@ihabpalamino
Copy link

ihabpalamino commented Jul 12, 2023

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from Scweet.scweet import scrape

Specify the parameters for scraping

username = "2MInteractive"
since_date = "2023-07-01"
until_date = "2023-07-11"
headless = True

Set up the ChromeDriver service

service = Service("C:/Users/HP Probook/Downloads/chromedriver.exe") # Replace with the actual path to chromedriver

Set up the ChromeOptions

options = webdriver.ChromeOptions()
options.headless = headless

Create the WebDriver

driver = webdriver.Chrome(service=service, options=options)

Scrape the tweets by username

data = scrape(from_account=username, since=since_date, until=until_date, headless=headless, driver=driver)

Print the scraped data

print(data)

Close the WebDriver

driver.quit()

getting empty data "C:\Users\HP Probook\PycharmProjects\firstproject\venv\Scripts\python.exe" "C:/Users/HP Probook/PycharmProjects/firstproject/TikTokScrap.py"
looking for tweets between 2023-07-01 and 2023-07-06 ...
path : https://twitter.com/search?q=(from%3A2MInteractive)%20until%3A2023-07-06%20since%3A2023-07-01%20&src=typed_query
scroll 1
scroll 2
looking for tweets between 2023-07-06 and 2023-07-11 ...
path : https://twitter.com/search?q=(from%3A2MInteractive)%20until%3A2023-07-11%20since%3A2023-07-06%20&src=typed_query
scroll 1
scroll 2
Empty DataFrame
Columns: [UserScreenName, UserName, Timestamp, Text, Embedded_text, Emojis, Comments, Likes, Retweets, Image link, Tweet URL]
Index: []

Process finished with exit code 0

@ihabpalamino ihabpalamino changed the title UnboundLocalError: local variable 'driver' referenced before assignment getting empty data Jul 12, 2023
@baqachadil
Copy link

baqachadil commented Jul 13, 2023

I have the exact same issue, I can see Selinium searching through Tweets for the specified period, but no data is returned, this is my code:

driver = init_driver(headless=False, show_images=False)

log_in(driver, env=".env")

data = scrape(words=['crypto', 'etheium', 'bitcoin'], hashtag='crypto', since="2023-02-01", until="2023-02-05", from_account=None, interval=1, headless=False, display_type=None, save_images=False, lang="en", resume=False, filter_replies=False, proximity=False, driver=driver)

print(data)

console:
looking for tweets between 2023-2-01 and 2023-02-02 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-02%20since%3A2023-2-01%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
scroll 5
scroll 6
scroll 7
scroll 8
looking for tweets between 2023-02-02 and 2023-02-03 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-03%20since%3A2023-02-02%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
scroll 5
scroll 6
scroll 7
scroll 8
scroll 9
looking for tweets between 2023-02-03 and 2023-02-04 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-04%20since%3A2023-02-03%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
scroll 5
scroll 6
scroll 7
looking for tweets between 2023-02-04 and 2023-02-05 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-05%20since%3A2023-02-04%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
scroll 5
scroll 6
scroll 7
looking for tweets between 2023-02-05 and 2023-02-06 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-06%20since%3A2023-02-05%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
scroll 5
scroll 6
scroll 7
scroll 8
looking for tweets between 2023-02-06 and 2023-02-07 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-07%20since%3A2023-02-06%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
scroll 5
scroll 6
scroll 7
looking for tweets between 2023-02-07 and 2023-02-08 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-08%20since%3A2023-02-07%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
scroll 5
scroll 6
scroll 7
scroll 8
looking for tweets between 2023-02-08 and 2023-02-09 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-09%20since%3A2023-02-08%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
scroll 5
scroll 6
scroll 7
scroll 8
looking for tweets between 2023-02-09 and 2023-02-10 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-10%20since%3A2023-02-09%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
looking for tweets between 2023-02-10 and 2023-02-11 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-11%20since%3A2023-02-10%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-11 and 2023-02-12 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-12%20since%3A2023-02-11%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-12 and 2023-02-13 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-13%20since%3A2023-02-12%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-13 and 2023-02-14 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-14%20since%3A2023-02-13%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-14 and 2023-02-15 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-15%20since%3A2023-02-14%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-15 and 2023-02-16 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-16%20since%3A2023-02-15%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-16 and 2023-02-17 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-17%20since%3A2023-02-16%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-17 and 2023-02-18 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-18%20since%3A2023-02-17%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-18 and 2023-02-19 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-19%20since%3A2023-02-18%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
looking for tweets between 2023-02-19 and 2023-02-20 ...
path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-20%20since%3A2023-02-19%20lang%3Aen&src=typed_query
scroll 1
scroll 2
scroll 3
scroll 4
Empty DataFrame
Columns: [UserScreenName, UserName, Timestamp, Text, Embedded_text, Emojis, Comments, Likes, Retweets, Image link, Tweet URL]

@baqachadil
Copy link

baqachadil commented Jul 18, 2023

I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of

find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.

The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text
except:
        text = ""

should actually be:

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
except:
        text = ""

in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:

def get_data(card, save_images=False, save_dir=None):
    """Extract data from tweet card"""
    image_links = []

    try:
        username = card.find_element('xpath','.//span').text
    except:
        return

    try:
        handle = card.find_element('xpath','.//span[contains(text(), "@")]').text
    except:
        return

    try:
        postdate = card.find_element('xpath','.//time').get_attribute('datetime')
    except:
        return

    try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
    except:
        text = ""

    try:
        embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text
    except:
        embedded = ""

    # tweet url
    try:
        element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a')
        tweet_url = element.get_attribute('href')
    except:
        return


    tweet = (
        username, handle, postdate, text, embedded, tweet_url)
    return tweet

This was referenced Jul 18, 2023
@ihabpalamino
Copy link
Author

and does it work?

I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of

find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.

The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text
except:
        text = ""

should actually be:

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
except:
        text = ""

in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:

def get_data(card, save_images=False, save_dir=None):
    """Extract data from tweet card"""
    image_links = []

    try:
        username = card.find_element('xpath','.//span').text
    except:
        return

    try:
        handle = card.find_element('xpath','.//span[contains(text(), "@")]').text
    except:
        return

    try:
        postdate = card.find_element('xpath','.//time').get_attribute('datetime')
    except:
        return

    try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
    except:
        text = ""

    try:
        embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text
    except:
        embedded = ""

    # tweet url
    try:
        element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a')
        tweet_url = element.get_attribute('href')
    except:
        return


    tweet = (
        username, handle, postdate, text, embedded, tweet_url)
    return tweet

@wdj1995
Copy link

wdj1995 commented Sep 12, 2023

I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of

find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.

The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text
except:
        text = ""

should actually be:

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
except:
        text = ""

in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:

def get_data(card, save_images=False, save_dir=None):
    """Extract data from tweet card"""
    image_links = []

    try:
        username = card.find_element('xpath','.//span').text
    except:
        return

    try:
        handle = card.find_element('xpath','.//span[contains(text(), "@")]').text
    except:
        return

    try:
        postdate = card.find_element('xpath','.//time').get_attribute('datetime')
    except:
        return

    try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
    except:
        text = ""

    try:
        embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text
    except:
        embedded = ""

    # tweet url
    try:
        element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a')
        tweet_url = element.get_attribute('href')
    except:
        return


    tweet = (
        username, handle, postdate, text, embedded, tweet_url)
    return tweet

Thanks for your great work!
But when I use your code, I found a question.

When the result is
"reply to @xxxxx
XXXXXXXXXX the embeded text XXXXXXX",

I only got the "reply to @xxxxx",
I could not get the real embeded text!
Could you help me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants