xpath and html

desk-user · December 8, 2013, 8:39pm

Hi,

I am trying to scrap the content of the items of this feed: http://libgen.org/rss/index.php

Everything that I have tried with custom configurations failed. I am not really sure why. I am running the 3.2 version. I wonder if it is because there is no xml on the pages of that website. Is there a limitation there?

fivefilters · December 8, 2013, 8:44pm

Hi Toto, that URL does not load for me at all. Does it work for you - accessing it directly in your browser, not through Full-Text RSS?

desk-user · December 9, 2013, 12:30pm

Hi, BIS

could you provide us the URL of your FULLTEXTRSS, to trying to help you

adnan

desk-user · December 9, 2013, 7:54pm

http://87.231.63.237/fulltext/

You can delete the url from my message when you see it. Thanks!

BIS

desk-user · December 9, 2013, 8:50pm

Oh and I forgot to answer the question. Yes, it works when I access it with my browser.

BIS

desk-user · December 10, 2013, 9:43am

Well, please note that iam not agent, i just fulltextrss’s user like you.

just trying to help.

now please provide me with patterns code you use for this feed & i will try helping you.

or you can simply try this in the file:

prune:no
tidy:no
body: //body/table

test this & let me know the results

adnan

desk-user · December 10, 2013, 6:20pm

Thank you adnan for your answer! I have tried your config and it still gives me weird results. Sometimes it shows me the content of the page (cover, summary and different informations), sometimes just the summary, sometimes nothing at all, sometimes some articles are fetched differently… I really don’t understand how it works.

BIS

desk-user · December 11, 2013, 6:02am

Without no thing “screen shot, sample of weird…etc” i can’t help you

Only one advice about weird results … try to set your server encoding to utf8.

another one is try to fetch your feed here http://fivefilters.org/content-only/

i tried it for your feed it works fair

I hope that my advices helps you

adnan

desk-user · December 12, 2013, 7:52pm

It doesn’t work well on my side. I have just tried “http://fivefilters.org/content-only/” with the feed and I got: [unable to retrieve full-text content]

You can also try with my copy of full-text, refresh the page and see if the content changes or not. When I do it, the content of each article changes when I refresh.

BIS

fivefilters · December 12, 2013, 8:29pm

Hi BIS, Adnan (thanks for the help)

A big problem here is simply that the server for this this feed appears to be down a lot and when it’s not down, it appears to be extremely slow. It is currently loading for me, but so slow that Full-Text RSS and the Feed Validator both give up waiting for the response to arrive.

You can see a brief history of the downtime and people’s complaints about it here: http://www.isitdownrightnow.com/lib.rus.ec.html

The feed validator gives up and fails to load the feed: http://feedvalidator.org/check.cgi?url=http%3A%2F%2Flibgen.org%2Frss%2Findex.php (“Server returned timed out” - maybe it’ll work when you try it).