Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract text from buttons for semantic elements #573

Open
zirkelc opened this issue Apr 23, 2024 · 1 comment
Open

Extract text from buttons for semantic elements #573

zirkelc opened this issue Apr 23, 2024 · 1 comment
Labels
question Further information is requested

Comments

@zirkelc
Copy link

zirkelc commented Apr 23, 2024

Hi,

I hope I'm not flooding you with too many issues 馃槃

I have an FAQ page that uses <button /> to show and hide the corresponding content. It looks roughly like this:

<div class="faq__item" itemscope="" itemprop="mainEntity" itemtype="https://schema.org/Question">
	<button is="toggle-button" class="collapsible-toggle text--strong"
		aria-controls="block-template--18681112592652__faq-18db33c8-b7be-456f-b5b4-8c935376f225" aria-expanded="true"
		itemprop="name">1. Question: Dolor sit amet<span class="animated-plus"></span>
	</button>

	<collapsible-content id="block-template--18681112592652__faq-18db33c8-b7be-456f-b5b4-8c935376f225"
		class="collapsible anchor" itemscope="" itemprop="acceptedAnswer" itemtype="https://schema.org/Answer"
		style="overflow: visible;" open="">
		<div class="collapsible__content text-container" itemprop="text">
			<p>1. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
				labore et dolore magna aliqua</p>
		</div>
	</collapsible-content>
</div>

The <button /> contains the question and the <collapsible-content/> contains the answer. However, when I extract the content, the text from the button is not extracted. This usually makes sense because buttons are not really semantic elements, but in this case the buttons contains valuable information.

Here's the full example faq.html:

<div class="faq">
	<div class="faq-navigation hidden-pocket">

	</div>
	<div class="faq__wrapper" itemscope="" itemtype="https://schema.org/FAQPage">
		<h2 id="category-template--18681112592652__faq-6c8e6357-90c2-4ba6-8bbb-95ce71242bae"
			class="faq__category heading h6 anchor">FAQs</h2>
		<div class="faq__item" itemscope="" itemprop="mainEntity" itemtype="https://schema.org/Question">
			<button is="toggle-button" class="collapsible-toggle text--strong"
				aria-controls="block-template--18681112592652__faq-18db33c8-b7be-456f-b5b4-8c935376f225" aria-expanded="true"
				itemprop="name">1. Question: Dolor sit amet<span class="animated-plus"></span>
			</button>

			<collapsible-content id="block-template--18681112592652__faq-18db33c8-b7be-456f-b5b4-8c935376f225"
				class="collapsible anchor" itemscope="" itemprop="acceptedAnswer" itemtype="https://schema.org/Answer"
				style="overflow: visible;" open="">
				<div class="collapsible__content text-container" itemprop="text">
					<p>1. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
						labore et dolore magna aliqua</p>
				</div>
			</collapsible-content>
		</div>
		<div class="faq__item" itemscope="" itemprop="mainEntity" itemtype="https://schema.org/Question">
			<button is="toggle-button" class="collapsible-toggle text--strong"
				aria-controls="block-template--18681112592652__faq-38d0d4d0-5a0b-4ba0-a733-8326f46abeef" aria-expanded="false"
				itemprop="name">2. Question: Dolor sit amet<span class="animated-plus"></span>
			</button>

			<collapsible-content id="block-template--18681112592652__faq-38d0d4d0-5a0b-4ba0-a733-8326f46abeef"
				class="collapsible anchor" itemscope="" itemprop="acceptedAnswer" itemtype="https://schema.org/Answer"
				style="overflow: hidden;">
				<div class="collapsible__content text-container" itemprop="text">
					<p>2. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
						labore et dolore magna
						aliqua</p>
				</div>
			</collapsible-content>
		</div>
		<div class="faq__item" itemscope="" itemprop="mainEntity" itemtype="https://schema.org/Question">
			<button is="toggle-button" class="collapsible-toggle text--strong"
				aria-controls="block-template--18681112592652__faq-04c7cdf4-44b6-454c-9f88-8ecab2ab380a" aria-expanded="false"
				itemprop="name">3. Question: Dolor sit amet<span class="animated-plus"></span>
			</button>

			<collapsible-content id="block-template--18681112592652__faq-04c7cdf4-44b6-454c-9f88-8ecab2ab380a"
				class="collapsible anchor" itemscope="" itemprop="acceptedAnswer" itemtype="https://schema.org/Answer">
				<div class="collapsible__content text-container" itemprop="text">
					<p>3. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
						labore et dolore magna
						aliqua</p>
				</div>
			</collapsible-content>
		</div>
	</div>
</div>

Here is how I run it:

cat faq.html | trafilatura --formatting --links

The received result:

1. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
3. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua

There seems to be also a small bug, because it extracts the first and third answer but omits the second answer:
2. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua

That's the result I would like to get:

## FAQs
1. Question: Dolor sit amet
1. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
2. Question: Dolor sit amet
2. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
3. Question: Dolor sit amet
3. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua

The FAQ elements contain semantic information from schema.org, like itemtype="https://schema.org/Question" and other attributes like itemprop and itemscope. Maybe it is possible to implement a rule that keeps these semantic elements in the result?

@adbar adbar added the question Further information is requested label Apr 24, 2024
@adbar
Copy link
Owner

adbar commented Apr 24, 2024

Hi, thanks for the detailed example, as you say this seems to be a bug (item 2), a potential enhancement (button), and a problem with the source at the same time (I think <collapsible-content> is not a valid HTML tag). I need to check what can reasonably be done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants