GetHTTPPage can't handle redirects? [Archive] - SRL

View Full Version : GetHTTPPage can't handle redirects?

KeepBotting

10-27-2014, 10:20 PM

(I guess this is the correct section for this.)

Mmkay, my problem lies in the way GetHTTPPage ... uh ... gets a http page.

For whatever reason, it can't handle a URL that redirects.

For instance, the normal URL for my powerbot profile page is https://www.powerbot.org/community/user/446368-keepbotting/

The way IP.Board's forum software works, you can remove the text in the last parameter of the URL, like so: https://www.powerbot.org/community/user/446368-/
Notice the text "keepbotting" is gone from the last portion of the URL.

When entering this into a normal browser, the incomplete link will automatically fetch the complete link and redirect my browser.

GetHTTPPage apparently can't compensate for this. It returns the string of HTML from the redirect page (which is nothing useful for me).

Any way I can get around this?

Ian

10-27-2014, 10:24 PM

I don't think there is, but if https://www.powerbot.org/community/user/446368-keepbotting/ is the direct link why not link to there?

KeepBotting

10-27-2014, 10:31 PM

I don't think there is, but if https://www.powerbot.org/community/user/446368-keepbotting/ is the direct link why not link to there?

That wouldn't normally be an issue. That's the way I'd do it in practice.
I'm just screwing around with crawling web pages in Simba.

I decided to try and make a script that would pull data off powerbot members' profile pages and organize it.
I figured the easiest way to do that would be to take the base URL for profiles (https://www.powerbot.org/community/user-XXXXXX/) and let the script fill in the member IDs. This would work because members IDs are completely ordinal.

Since GetHTTPPage can't follow redirect links, I either need a way around that, or a way to guess the username of each member ID.

Ian

10-27-2014, 10:56 PM

That wouldn't normally be an issue. That's the way I'd do it in practice.
I'm just screwing around with crawling web pages in Simba.

I decided to try and make a script that would pull data off powerbot members' profile pages and organize it.
I figured the easiest way to do that would be to take the base URL for profiles (https://www.powerbot.org/community/user-XXXXXX/) and let the script fill in the member IDs. This would work because members IDs are completely ordinal.

Since GetHTTPPage can't follow redirect links, I either need a way around that, or a way to guess the username of each member ID.

Hm, I see...

I just tried and it looks like https isn't supported by getpage. Returns blank unless I remove the s.

Unfortunately getpage returns blank on a link that gets redirected (for powerbot), some other sites show some information about where you're being redirected to.

Edit: You could also try using APPA, that should handle the redirects fine. It will be slower though.

Chris

10-28-2014, 10:51 AM

If the HTML of the redirect page contains the URL it redirects to, you could extract it and use in a new getPage().

bg5

10-29-2014, 08:34 PM

GetHTTPPage is not supposed to redirect, because it just gets a page. You can't compare it to browser, which has heavy libraries to handle all http-request headers.
But you can do...

program new;

function GetHTTPPage2( client:integer; page:string) : string;
var
re :TRegExpr;
header : string;
pos:integer;
begin
Result := GetHTTPPage(client,page);
header := GetRawHeaders(client);
re.Init();
re.setExpression('Location: ');

if re.Exec(header) then
begin
pos := re.getMatchPos(0)+length(re.getExpression);
re.setExpression('\S+');
if re.ExecPos(pos) then
begin
writeln(re.getMatch(0) );
Result := GetHTTPPage(client, re.getMatch(0));
end else
writeln('GetHTTPPage2: Spaces in url ?!');
end;
end;
var c:integer;
begin
c := InitializeHTTPClient(true);
writeln( GetHTTPPage2(c,'https://www.powerbot.org/community/user/446368-/') );
end.

My Simba doesn't work with https, so I'm getting blank page anyway.

Ian

10-29-2014, 10:52 PM