Pulling wrong title

luapnampahc · January 24, 2020, 4:50pm

Hi see feed below. It’s pulling the wrong part of page. Is there a way I can amend this so that I can select from various heading / title options? Thanks

http://ftr-premium.fivefilters.org/makefulltextfeed.php?use_extracted_title=1&url=createfeed.fivefilters.org%2Fextract.php%3Furl%3Dhttps%3A%2F%2Fhigcapital.com%2Fnews%26in_id_or_class%3Dnews-list%26url_contains%3D%26key%3D2460%26hash%3D518f46dfed1fc6acfef5b5541bbe331a8ea06e1983de3b8af0830af315269304&key=2460&hash=b40060d86dd66ccb71a177dd42114b3de40cf820

fivefilters · January 25, 2020, 12:36am

Hi there, we’ve added rules to improve extraction for this site. Should work for you now.

luapnampahc · February 3, 2020, 10:55am

Thank you. Is there a way that a user can manage their own extraction rules. I am coming up against a lot of issues with the extraction pulling "press releases’ as the article title and not the actual title of the article.

Example: http://ftr-premium.fivefilters.org/makefulltextfeed.php?use_extracted_title=1&url=createfeed.fivefilters.org%2Fextract.php%3Furl%3Dhttps%253A%252F%252Fwww.bankofthewest.com%252Fabout-us%252Fpress-center%252Fpress-releases.html%26in_id_or_class%3Dtab-pane%2Bactive%26url_contains%3D%26key%3D2460%26hash%3D18c27a8a488a2bd06b3d47351a6ecf8f1d7d3143cbcbd1b3c0049765b3165709&key=2460&hash=de2dd87bd7689d80f40f25dda5d3835a67ae16d0

Presumbly, its trying to pull the page title rather than a H1 or H2 for example. Import.io does this rather nicely.

fivefilters · February 4, 2020, 1:56pm

Full-Text RSS prioritises the title in the feed (if input is a feed) or the <title> element of the page if input is a URL of an HTML page.

In most cases the title element will contain the main title. It’s usually in the site’s interest to put the main title here as that’s what Google will list in its search results too. In this case, you’ll see Google treats the article’s title the same way Full-Text RSS does:

But because you’re giving Full-Text RSS a feed which you’ve generated using our Feed Creator tool, you’ve actually got the real title in the Feed Creator output:

http://createfeed.fivefilters.org/extract.php?url=https%3A%2F%2Fwww.bankofthewest.com%2Fabout-us%2Fpress-center%2Fpress-releases.html&in_id_or_class=tab-pane+active&url_contains=&key=2460&hash=18c27a8a488a2bd06b3d47351a6ecf8f1d7d3143cbcbd1b3c0049765b3165709

But by using the use_extracted_title=1 parameter when passing this to Full-Text RSS, you’re telling Full-Text RSS to ignore the article’s title in the feed and try to extract it from the HTML. If you remove this parameter, you should get the result you want:

http://ftr-premium.fivefilters.org/makefulltextfeed.php?url=createfeed.fivefilters.org%2Fextract.php%3Furl%3Dhttps%253A%252F%252Fwww.bankofthewest.com%252Fabout-us%252Fpress-center%252Fpress-releases.html%26in_id_or_class%3Dtab-pane%2Bactive%26url_contains%3D%26key%3D2460%26hash%3D18c27a8a488a2bd06b3d47351a6ecf8f1d7d3143cbcbd1b3c0049765b3165709&key=2460&hash=de2dd87bd7689d80f40f25dda5d3835a67ae16d0

So the simplest solution here seems to be the above. Just let Full-Text RSS use the title you’ve already got in the feed.

Having said that, to answer your question:

Is there a way that a user can manage their own extraction rules.

You can override default extraction rules using the siteconfig parameter. This allows you to use the directives we list here. So let’s say you’re not using an input feed and want to extract the correct press release title from the HTML. The source HTML looks like this:

<h1>Press Release</h1>
<h2><br>
<strong>Bank of the West Named One of the Best 
Places to Work for LGBTQ Equality and Diversity</strong>
<br>
<i>Sustainable Finance Leader Received a Perfect 
Score of 100% on the Human Rights Campaign 
Foundation’s Corporate Equality Index for Second 
Year in a Row<br><br>
Named to Forbes’ 2020 Best Employers for Diversity List</i><br><br>
San Francisco, CA | Jan 15, 2020
<br><br></h2>

What you’re after is the <strong> element inside the <h2> element.

With XPath this would be:

//h2//strong

To tell Full-Text RSS you want this as the title, you’d use:

title: //h2//strong

Because we’re going to pass this to Full-Text RSS in the query string of the URL, in the siteconfig parameter, we need to make sure it’s URL Encoded:

&siteconfig=title%3A%2F%2Fh2%2F%2Fstrong

If we add this to your original URL, you’ll see its effect:

http://ftr-premium.fivefilters.org/makefulltextfeed.php?use_extracted_title=1&siteconfig=title%3A%2F%2Fh2%2F%2Fstrong&url=createfeed.fivefilters.org%2Fextract.php%3Furl%3Dhttps%253A%252F%252Fwww.bankofthewest.com%252Fabout-us%252Fpress-center%252Fpress-releases.html%26in_id_or_class%3Dtab-pane%2Bactive%26url_contains%3D%26key%3D2460%26hash%3D18c27a8a488a2bd06b3d47351a6ecf8f1d7d3143cbcbd1b3c0049765b3165709&key=2460&hash=de2dd87bd7689d80f40f25dda5d3835a67ae16d0

I’ll add that if you find you need to adjust extraction rules a lot like this, you will have more flexibility running our self-hosted copy of Full-Text RSS. You then have access to a site_config/custom/ folder where you can create extraction rules to override the default ones. And you won’t have to worry about URL encoding things to pass in the query string. For example, you could have a file site_config/custom/bankofthewest.com.txt with the contents:

title: //h2//strong

Hope that’s some help.

luapnampahc · February 7, 2020, 4:58pm

Awesome. Just awesome, thanks a lot.