Hi see feed below. It’s pulling the wrong part of page. Is there a way I can amend this so that I can select from various heading / title options? Thanks
Hi there, we’ve added rules to improve extraction for this site. Should work for you now.
Thank you. Is there a way that a user can manage their own extraction rules. I am coming up against a lot of issues with the extraction pulling "press releases’ as the article title and not the actual title of the article.
Presumbly, its trying to pull the page title rather than a H1 or H2 for example. Import.io does this rather nicely.
Full-Text RSS prioritises the title in the feed (if input is a feed) or the <title>
element of the page if input is a URL of an HTML page.
In most cases the title element will contain the main title. It’s usually in the site’s interest to put the main title here as that’s what Google will list in its search results too. In this case, you’ll see Google treats the article’s title the same way Full-Text RSS does:
But because you’re giving Full-Text RSS a feed which you’ve generated using our Feed Creator tool, you’ve actually got the real title in the Feed Creator output:
But by using the use_extracted_title=1
parameter when passing this to Full-Text RSS, you’re telling Full-Text RSS to ignore the article’s title in the feed and try to extract it from the HTML. If you remove this parameter, you should get the result you want:
So the simplest solution here seems to be the above. Just let Full-Text RSS use the title you’ve already got in the feed.
Having said that, to answer your question:
Is there a way that a user can manage their own extraction rules.
You can override default extraction rules using the siteconfig
parameter. This allows you to use the directives we list here. So let’s say you’re not using an input feed and want to extract the correct press release title from the HTML. The source HTML looks like this:
<h1>Press Release</h1>
<h2><br>
<strong>Bank of the West Named One of the Best
Places to Work for LGBTQ Equality and Diversity</strong>
<br>
<i>Sustainable Finance Leader Received a Perfect
Score of 100% on the Human Rights Campaign
Foundation’s Corporate Equality Index for Second
Year in a Row<br><br>
Named to Forbes’ 2020 Best Employers for Diversity List</i><br><br>
San Francisco, CA | Jan 15, 2020
<br><br></h2>
What you’re after is the <strong>
element inside the <h2>
element.
With XPath this would be:
//h2//strong
To tell Full-Text RSS you want this as the title, you’d use:
title: //h2//strong
Because we’re going to pass this to Full-Text RSS in the query string of the URL, in the siteconfig
parameter, we need to make sure it’s URL Encoded:
&siteconfig=title%3A%2F%2Fh2%2F%2Fstrong
If we add this to your original URL, you’ll see its effect:
I’ll add that if you find you need to adjust extraction rules a lot like this, you will have more flexibility running our self-hosted copy of Full-Text RSS. You then have access to a site_config/custom/
folder where you can create extraction rules to override the default ones. And you won’t have to worry about URL encoding things to pass in the query string. For example, you could have a file site_config/custom/bankofthewest.com.txt
with the contents:
title: //h2//strong
Hope that’s some help.
Awesome. Just awesome, thanks a lot.