On Fri, Jul 31, 2009 at 11:34 AM, Gary Baluha <gumby3203 at gmail.com> wrote:
On Fri, Jul 31, 2009 at 4:57 AM, Ralph Mitchell <ralphmitchell at gmail.com>wrote:
I could really have used something like your feature request about 6 years ago. Instead I spent a lot of time handcrafting bash scripts to login to web pages.
Yep, that's kind of how URLPlus got started in the first place ;-)
Don't get me started on the sites that hit you with 5 different types of redirects before reaching the front page, or the sites where each input field is held in it's own personal form. and the submit button executes javascript to copy the values into form full of hidden fields for the actual submittal.
The redirect issue actually isn't too difficult to work around. I have been working on a perl program that is capable of more in-depth session management than URLPlus is currently capable of, and the solution I'm using now seems to work pretty well. My goal is to eventually convert URLPlus from using a command-line curl solution, to my current one. This new method deals with multi-page redirects better.
It's not so much the multi-page redirects using the standard "302: page is now elsewhere" format, as the other weird ways redirects are sometimes done. The one that irritated me the most did all of these, in no particular order:
meta-refresh with zero time delay and a new url
self-submitting form - i.e. a preloaded form with "form.submit();" at the end of the html, between script tags
self-submitting form - another preloaded form, but with "onLoad=form.submit();" in the html BODY tag
in script tags, change the page location via: top.location="newurl"
as above, but use "top.href", or "page.href" or something similar.
I'm not knocking your efforts - you've already done more than I ever did towards a generic webpage check. I just think that the above are going to be tricky to handle in an automated way without replicating a large fraction of a web browser. But, now at least they're documented in the mailing list for anyone interested in doing their own web checks... :)
As for the javascript part, that is a bit more difficult.
Especially when the page you just downloaded creates the form POST url on-the-fly from some of the form elements filled in by the user. Yep, saw that happen too... Another weird page ran a java function to generate a random character string to include in the url - luckily the function wasn't too hard to extract and shove through the spidermonkey javascript interpreter... :)
Ralph Mitchell