Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile scraping error: res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*") #178

Open
Kiru6ik opened this issue Jun 26, 2023 · 8 comments

Comments

@Kiru6ik
Copy link

Kiru6ik commented Jun 26, 2023

When scraping the person who worked multiple times at organization this error occurred.
I checked the page structure and it should work fine but for some reason it fails.
This part of code causes the problem:
if position_summary_text and len(position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")) > 1: #.find_element(By.CLASS_NAME,"pvs-list") descriptions = position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li") for description in descriptions: res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*") position_title_elem = res[0] if len(res) > 0 else None work_times_elem = res[1] if len(res) > 1 else None location_elem = res[2] if len(res) > 2 else None
it cant find res by tag name a.
As far as I understood it tries to find the top part of the job description(title, duration at position, location) and all this is located under a tag on the web page. @joeyism do you have any insights on that? Am I referring correctly to the part of the page that this code is trying to analyse?

The whole error message:
`Traceback (most recent call last):
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\lists_check.py", line 23, in
person.scrape(close_on_complete=False)
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\venv\lib\site-packages\linkedin_scraper\person.py", line 89, in scrape
self.scrape_logged_in(close_on_complete=close_on_complete)
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\venv\lib\site-packages\linkedin_scraper\person.py", line 285, in scrape_logged_in
self.get_experiences()
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\venv\lib\site-packages\linkedin_scraper\person.py", line 156, in get_experiences
res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*")
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\venv\lib\site-packages\selenium\webdriver\remote\webelement.py", line 417, in find_element
return self._execute(Command.FIND_CHILD_ELEMENT, {"using": by, "value": value})["value"]
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\venv\lib\site-packages\selenium\webdriver\remote\webelement.py", line 395, in _execute
return self._parent.execute(command, params)
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 346, in execute
self.error_handler.check_response(response)
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\venv\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 245, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"tag name","selector":"a"}
(Session info: chrome=114.0.5735.134); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
Backtrace:
GetHandleVerifier [0x0025A813+48355]
(No symbol) [0x001EC4B1]
(No symbol) [0x000F5358]
(No symbol) [0x001209A5]
(No symbol) [0x00120B3B]
(No symbol) [0x00119AE1]
(No symbol) [0x0013A784]
(No symbol) [0x00119A36]
(No symbol) [0x0013AA94]
(No symbol) [0x0014C922]
(No symbol) [0x0013A536]
(No symbol) [0x001182DC]
(No symbol) [0x001193DD]
GetHandleVerifier [0x004BAABD+2539405]
GetHandleVerifier [0x004FA78F+2800735]
GetHandleVerifier [0x004F456C+2775612]
GetHandleVerifier [0x002E51E0+616112]
(No symbol) [0x001F5F8C]
(No symbol) [0x001F2328]
(No symbol) [0x001F240B]
(No symbol) [0x001E4FF7]
BaseThreadInitThunk [0x762B0099+25]
RtlGetAppContainerNamedObjectPath [0x77A97B6E+286]
RtlGetAppContainerNamedObjectPath [0x77A97B3E+238]
(No symbol) [0x00000000]

Process finished with exit code 1
`

@joeyism
Copy link
Owner

joeyism commented Jun 26, 2023

Can you provide the code that you've used please?

@Kiru6ik
Copy link
Author

Kiru6ik commented Jun 26, 2023

Sorry, I forgot to include the failing account at the first place. This bug occurred at this profile: https://www.linkedin.com/in/sheanahamill/.
Error occurs at any basic person scraping. This is a code I used to discover this bug

from selenium.common.exceptions import WebDriverException
from selenium import webdriver
from linkedin_scraper import Person, actions, Company
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import time, pickle
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("user-data-dir=C:\\Users\\User\\AppData\\Local\\Google\\Chrome\\User Data\\Profile 3")


driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)

person=Person("https://www.linkedin.com/in/sheanahamill", driver=driver, scrape=False)
time.sleep(3)
person.scrape(close_on_complete=False)

name=person.name
title=person.job_title
now_company=person.company
print(name, title, now_company)
experience=person.experiences
print(experience)
current_company=experience[0]
print(current_company)
link_to_company=current_company.linkedin_url
print(link_to_company)
location=current_company.location
print(location)


company=Company(link_to_company, driver=driver, get_employees=False, close_on_complete=False)

company_name=company.name
company_size=company.company_size
company_website=company.website
about=company.about_us
print(company_name, company_size, company_website, about)

this code works fine with other account(other that log1 problem from #173)

@khamamoto6
Copy link

khamamoto6 commented Jun 28, 2023

Hey - I only updated two functions as I needed: get_experiences() and get_name_and_location(). In addition to UI updates I also fixed the scraper issue where it gets confused when a person has multiple positions at the same company over time.

You can selectively scrape by doing this:
person=Person("https://www.linkedin.com/in/sheanahamill", driver=driver, scrape=False)
person.get_experiences()
print(person.experiences)

def get_name_and_location(self):
        main = self.wait_for_element_to_load(by=By.TAG_NAME, name="main")
        top_panels = main.find_elements(By.CLASS_NAME,"pv-text-details__left-panel")
        self.name = top_panels[0].find_elements(By.XPATH,"*")[0].text
        self.location = top_panels[1].find_element(By.TAG_NAME,"span").text

def get_experiences(self): # modified
        url = os.path.join(self.linkedin_url, "details/experience")
        self.driver.get(url)
        self.focus()
        main = self.wait_for_element_to_load(by=By.TAG_NAME, name="main")
        self.scroll_to_half()
        self.scroll_to_bottom()
        main_list = self.wait_for_element_to_load(name="pvs-list", base=main)
        for position in main_list.find_elements(By.XPATH,"li"):
            position = position.find_element(By.CLASS_NAME,"pvs-entity")
            company_logo_elem, position_details = position.find_elements(By.XPATH,"*")

            # company elem
            company_linkedin_url = company_logo_elem.find_element(By.XPATH,"*").get_attribute("href")

            # position details
            position_details_list = position_details.find_elements(By.XPATH,"*")
            position_summary_details = position_details_list[0] if len(position_details_list) > 0 else None
            position_summary_text = position_details_list[1] if len(position_details_list) > 1 else None # skills OR list of positions
            outer_positions = position_summary_details.find_element(By.XPATH,"*").find_elements(By.XPATH,"*")

            if len(outer_positions) == 4:
                position_title = outer_positions[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].text
                company = outer_positions[1].find_element(By.TAG_NAME,"span").text
                work_times = outer_positions[2].find_element(By.TAG_NAME,"span").text
                location = outer_positions[3].find_element(By.TAG_NAME,"span").text
            elif len(outer_positions) == 3:
                if "·" in outer_positions[2].text:
                    position_title = outer_positions[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].text
                    company = outer_positions[1].find_element(By.TAG_NAME,"span").text
                    work_times = outer_positions[2].find_element(By.TAG_NAME,"span").text
                    location = ""
                else:
                    position_title = ""
                    company = outer_positions[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].text
                    work_times = outer_positions[1].find_element(By.TAG_NAME,"span").text
                    location = outer_positions[2].find_element(By.TAG_NAME,"span").text

            elif len(outer_positions) == 2: # this is for when person has multiple pos over time at one company
                company_div, work_times_div = outer_positions
                company = company_div.find_element(By.TAG_NAME,"span").text
                company_linkedin_url = ""
                print(colored(company, 'yellow'))

                positions_list = position_summary_text.find_element(By.CLASS_NAME, "pvs-list").find_element(By.CLASS_NAME, "pvs-list")

                for position in positions_list.find_elements(By.XPATH,"*"):
                    print(colored('count position', "yellow"))
                    position = position.find_element(By.CLASS_NAME,"pvs-entity")
                    position_details_list = position.find_elements(By.XPATH,"*")[1].find_elements(By.XPATH,"*")

                    position_summary_details = position_details_list[0] if len(position_details_list) > 0 else None
                    position_summary_text = position_details_list[1] if len(position_details_list) > 1 else None # skills OR list of positions
                    outer_positions = position_summary_details.find_element(By.XPATH,"*").find_elements(By.XPATH,"*")

                    if len(outer_positions) == 3:
                        position_title = outer_positions[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].text
                        print(colored(position_title, 'yellow'))
                        work_times = outer_positions[1].find_element(By.TAG_NAME,"span").text
                        location = outer_positions[2].find_element(By.TAG_NAME,"span").text
                    else:
                        print('need fix.')

                    if 'work_times' not in locals() and 'work_times' not in globals():
                        work_times = None # modified
                    times = work_times.split("·")[0].strip() if work_times else ""
                    duration = work_times.split("·")[1].strip() if times != "" and len(work_times.split("·")) > 1 else None # modified

                    from_date = " ".join(times.split(" ")[:2]) if times else ""
                    to_date = " ".join(times.split(" ")[3:]) if times else ""

                    if position_summary_text and len(position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")) > 1:
                        descriptions = position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")
                        for description in descriptions:
                            res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*")
                            position_title_elem = res[0] if len(res) > 0 else None
                            work_times_elem = res[1] if len(res) > 1 else None
                            location_elem = res[2] if len(res) > 2 else None

                            location = location_elem.find_element(By.XPATH,"*").text if location_elem else None
                            position_title = position_title_elem.find_element(By.XPATH,"*").find_element(By.TAG_NAME,"*").text if position_title_elem else ""
                            work_times = work_times_elem.find_element(By.XPATH,"*").text if work_times_elem else ""
                            times = work_times.split("·")[0].strip() if work_times else ""
                            duration = work_times.split("·")[1].strip() if len(work_times.split("·")) > 1 else None
                            from_date = " ".join(times.split(" ")[:2]) if times else ""
                            to_date = " ".join(times.split(" ")[3:]) if times else ""

                            experience = Experience(
                                position_title=position_title,
                                from_date=from_date,
                                to_date=to_date,
                                duration=duration,
                                location=location,
                                description=description,
                                institution_name=company if 'company' in locals() or 'company' in globals() else "Not provided", #modified
                                linkedin_url=company_linkedin_url
                            )
                            self.add_experience(experience)
                    else:
                        description = position_summary_text.text if position_summary_text else ""

                        experience = Experience(
                            position_title=position_title,
                            from_date=from_date,
                            to_date=to_date,
                            duration=duration,
                            location=location,
                            description=description,
                            institution_name=company,
                            linkedin_url=company_linkedin_url
                        )
                        self.add_experience(experience)
                return


            if 'work_times' not in locals() and 'work_times' not in globals():
                work_times = None
            times = work_times.split("·")[0].strip() if work_times else ""
            duration = work_times.split("·")[1].strip() if times != "" and len(work_times.split("·")) > 1 else None

            from_date = " ".join(times.split(" ")[:2]) if times else ""
            to_date = " ".join(times.split(" ")[3:]) if times else ""

            if position_summary_text and len(position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")) > 1:
                descriptions = position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")
                for description in descriptions:
                    res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*")
                    position_title_elem = res[0] if len(res) > 0 else None
                    work_times_elem = res[1] if len(res) > 1 else None
                    location_elem = res[2] if len(res) > 2 else None

                    location = location_elem.find_element(By.XPATH,"*").text if location_elem else None
                    position_title = position_title_elem.find_element(By.XPATH,"*").find_element(By.TAG_NAME,"*").text if position_title_elem else ""
                    work_times = work_times_elem.find_element(By.XPATH,"*").text if work_times_elem else ""
                    times = work_times.split("·")[0].strip() if work_times else ""
                    duration = work_times.split("·")[1].strip() if len(work_times.split("·")) > 1 else None
                    from_date = " ".join(times.split(" ")[:2]) if times else ""
                    to_date = " ".join(times.split(" ")[3:]) if times else ""

                    experience = Experience(
                        position_title=position_title,
                        from_date=from_date,
                        to_date=to_date,
                        duration=duration,
                        location=location,
                        description=description,
                        institution_name=company if 'company' in locals() or 'company' in globals() else "Not provided",
                        linkedin_url=company_linkedin_url
                    )
                    self.add_experience(experience)
            else:
                description = position_summary_text.text if position_summary_text else ""

                experience = Experience(
                    position_title=position_title,
                    from_date=from_date,
                    to_date=to_date,
                    duration=duration,
                    location=location,
                    description=description,
                    institution_name=company,
                    linkedin_url=company_linkedin_url
                )
                self.add_experience(experience)

This is from ~ a week ago, hopefully still working.

@Kiru6ik
Copy link
Author

Kiru6ik commented Jul 4, 2023

Still facing same issue even with this update

@joeyism
Copy link
Owner

joeyism commented Jul 4, 2023

I just deployed a fix. Please try with v2.11.2 please

@Kiru6ik
Copy link
Author

Kiru6ik commented Jul 4, 2023

Thanks it works, I tested it on 2 profiles but havent tested at scale yet.
I am new to git; I dont know how to submit a pr but the company.py doesnt work either.
Updates needed are:

  1. Change class name to mb6 in line 210: grid = driver.find_element(By.CLASS_NAME, "mb6") # used to be artdeco-card.p5.mb4
  2. Change class name to mb1 in line 241: grid = driver.find_element(By.CLASS_NAME, "mb1") # used to be mt1

And now it works for me

@arpit5292
Copy link

Thanks it works, I tested it on 2 profiles but havent tested at scale yet. I am new to git; I dont know how to submit a pr but the company.py doesnt work either. Updates needed are:

  1. Change class name to mb6 in line 210: grid = driver.find_element(By.CLASS_NAME, "mb6") # used to be artdeco-card.p5.mb4
  2. Change class name to mb1 in line 241: grid = driver.find_element(By.CLASS_NAME, "mb1") # used to be mt1

And now it works for me

hi it is not working i have changed as it shown properties "https://www.linkedin.com/company/google" i was checking

@Kiru6ik
Copy link
Author

Kiru6ik commented Aug 16, 2023

The way I troubleshooted it is:

  1. Try to identify the part of the scraping that is failing
  2. See the error
  3. Try to understand what it is doing and what its function
  4. Find the block that this part is trying to find(might be challenging as sometimes its not clear)
  5. Find the new element name etc
    You can send the full error message and I can try helping out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants