PDA

View Full Version : GetHTTPPage can't handle redirects?



KeepBotting
10-27-2014, 10:20 PM
(I guess this is the correct section for this.)

Mmkay, my problem lies in the way GetHTTPPage ... uh ... gets a http page.

For whatever reason, it can't handle a URL that redirects.

For instance, the normal URL for my powerbot profile page is https://www.powerbot.org/community/user/446368-keepbotting/

The way IP.Board's forum software works, you can remove the text in the last parameter of the URL, like so: https://www.powerbot.org/community/user/446368-/
Notice the text "keepbotting" is gone from the last portion of the URL.

When entering this into a normal browser, the incomplete link will automatically fetch the complete link and redirect my browser.

GetHTTPPage apparently can't compensate for this. It returns the string of HTML from the redirect page (which is nothing useful for me).

Any way I can get around this?

Ian
10-27-2014, 10:24 PM
I don't think there is, but if https://www.powerbot.org/community/user/446368-keepbotting/ is the direct link why not link to there?

KeepBotting
10-27-2014, 10:31 PM
I don't think there is, but if https://www.powerbot.org/community/user/446368-keepbotting/ is the direct link why not link to there?

That wouldn't normally be an issue. That's the way I'd do it in practice.
I'm just screwing around with crawling web pages in Simba.

I decided to try and make a script that would pull data off powerbot members' profile pages and organize it.
I figured the easiest way to do that would be to take the base URL for profiles (https://www.powerbot.org/community/user-XXXXXX/) and let the script fill in the member IDs. This would work because members IDs are completely ordinal.

Since GetHTTPPage can't follow redirect links, I either need a way around that, or a way to guess the username of each member ID.

Ian
10-27-2014, 10:56 PM
That wouldn't normally be an issue. That's the way I'd do it in practice.
I'm just screwing around with crawling web pages in Simba.

I decided to try and make a script that would pull data off powerbot members' profile pages and organize it.
I figured the easiest way to do that would be to take the base URL for profiles (https://www.powerbot.org/community/user-XXXXXX/) and let the script fill in the member IDs. This would work because members IDs are completely ordinal.

Since GetHTTPPage can't follow redirect links, I either need a way around that, or a way to guess the username of each member ID.

Hm, I see...

I just tried and it looks like https isn't supported by getpage. Returns blank unless I remove the s.

Unfortunately getpage returns blank on a link that gets redirected (for powerbot), some other sites show some information about where you're being redirected to.

Edit: You could also try using APPA, that should handle the redirects fine. It will be slower though.

Chris
10-28-2014, 10:51 AM
If the HTML of the redirect page contains the URL it redirects to, you could extract it and use in a new getPage().

bg5
10-29-2014, 08:34 PM
GetHTTPPage is not supposed to redirect, because it just gets a page. You can't compare it to browser, which has heavy libraries to handle all http-request headers.
But you can do...

program new;

function GetHTTPPage2( client:integer; page:string) : string;
var
re :TRegExpr;
header : string;
pos:integer;
begin
Result := GetHTTPPage(client,page);
header := GetRawHeaders(client);
re.Init();
re.setExpression('Location: ');

if re.Exec(header) then
begin
pos := re.getMatchPos(0)+length(re.getExpression);
re.setExpression('\S+');
if re.ExecPos(pos) then
begin
writeln(re.getMatch(0) );
Result := GetHTTPPage(client, re.getMatch(0));
end else
writeln('GetHTTPPage2: Spaces in url ?!');
end;
end;
var c:integer;
begin
c := InitializeHTTPClient(true);
writeln( GetHTTPPage2(c,'https://www.powerbot.org/community/user/446368-/') );
end.

My Simba doesn't work with https, so I'm getting blank page anyway.

Ian
10-29-2014, 10:52 PM
GetHTTPPage is not supposed to redirect, because it just gets a page. You can't compare it to browser, which has heavy libraries to handle all http-request headers.
But you can do...

-snip-

My Simba doesn't work with https, so I'm getting blank page anyway.

If you want you can use just http, I tested with yours and it works.