python请求启用cookies/javascript

tintown 阅读:11 2024-12-31 21:38:35 评论:0

我尝试从特定网站下载 excel 文件。在我的本地计算机上,它完美运行:

>>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls') 
>>> r.status_code 
200 
>>> r.content 
b'\xd0\xcf\x11\xe0\xa1\xb1...\x00\x00' # Long binary string 

但是当我连接到远程 ubuntu 服务器时,我收到一条与启用 cookie/javascript 相关的消息。
r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls') 
>>> r.status_code 
200 
>>> r.content 
b'<HTML>\n<head>\n<script>\nChallenge=141020;\nChallengeId=120854618;\nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";\n</script>\n<script>\nfunction test(var1)\n{\n\tvar var_str=""+Challenge;\n\tvar var_arr=var_str.split("");\n\tvar LastDig=var_arr.reverse()[0];\n\tvar minDig=var_arr.sort()[0];\n\tvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);\n\tvar subvar2 = (2 * var_arr[2])+var_arr[1];\n\tvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);\n\tvar x=(var1*3+subvar1)*1;\n\tvar y=Math.cos(Math.PI*subvar2);\n\tvar answer=x*y;\n\tanswer-=my_pow*1;\n\tanswer+=(minDig*1)-(LastDig*1);\n\tanswer=answer+subvar2;\n\treturn answer;\n}\n</script>\n<script>\nclient = null;\nif (window.XMLHttpRequest)\n{\n\tvar client=new XMLHttpRequest();\n}\nelse\n{\n\tif (window.ActiveXObject)\n\t{\n\t\tclient = new ActiveXObject(\'MSXML2.XMLHTTP.3.0\');\n\t};\n}\nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))\n{\n\tdocument.write("Not all needed JavaScript methods are supported.<BR>");\n\n}\nelse\n{\n\tclient.onreadystatechange  = function()\n\t{\n\t\tif(client.readyState  == 4)\n\t\t{\n\t\t\tvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");\n\t\t\tif ((MyCookie == null) || (MyCookie==""))\n\t\t\t{\n\t\t\t\tdocument.write(client.responseText);\n\t\t\t\treturn;\n\t\t\t}\n\t\t\t\n\t\t\tvar cookieName = MyCookie.split(\'=\')[0];\n\t\t\tif (document.cookie.indexOf(cookieName)==-1)\n\t\t\t{\n\t\t\t\tdocument.write(GenericErrorMessageCookies);\n\t\t\t\treturn;\n\t\t\t}\n\t\t\twindow.location.reload(true);\n\t\t}\n\t};\n\ty=test(Challenge);\n\tclient.open("POST",window.location,true);\n\tclient.setRequestHeader(\'X-AA-Challenge-ID\', ChallengeId);\n\tclient.setRequestHeader(\'X-AA-Challenge-Result\',y);\n\tclient.setRequestHeader(\'X-AA-Challenge\',Challenge);\n\tclient.setRequestHeader(\'Content-Type\' , \'text/plain\');\n\tclient.send();\n}\n</script>\n</head>\n<body>\n<noscript>JavaScript must be enabled in order to view this page.</noscript>\n</body>\n</HTML>' 

在本地,我从安装了 Chrome 的 MACos 运行(我没有主动将它用于脚本,但可能是相关的?),在远程我没有安装任何 GUI 浏览器的情况下在 digital ocean 上运行 ubuntu。

请您参考如下方法:

requests的行为与系统上安装的浏览器无关,也不以任何方式依赖它们或与之交互。

这里的问题是您请求的资源启用了某种“机器人缓解”机制来防止这种访问。它返回一些带有需要评估的逻辑的 javascript,然后该逻辑的结果用于附加请求,以“证明”您不是机器人。

幸运的是,这个特定的缓解机制似乎是 solved before ,并且我能够使用该代码中的挑战解决功能快速使此请求工作:

from math import cos, pi, floor 
 
import requests 
 
URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls' 
 
 
def parse_challenge(page): 
    """ 
    Parse a challenge given by mmi and mavat's web servers, forcing us to solve 
    some math stuff and send the result as a header to actually get the page. 
    This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb 
    """ 
    top = page.split('<script>')[1].split('\n') 
    challenge = top[1].split(';')[0].split('=')[1] 
    challenge_id = top[2].split(';')[0].split('=')[1] 
    return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)} 
 
 
def get_challenge_answer(challenge): 
    """ 
    Solve the math part of the challenge and get the result 
    """ 
    arr = list(challenge) 
    last_digit = int(arr[-1]) 
    arr.sort() 
    min_digit = int(arr[0]) 
    subvar1 = (2 * int(arr[2])) + int(arr[1]) 
    subvar2 = str(2 * int(arr[2])) + arr[1] 
    power = ((int(arr[0]) * 1) + 2) ** int(arr[1]) 
    x = (int(challenge) * 3 + subvar1) 
    y = cos(pi * subvar1) 
    answer = x * y 
    answer -= power 
    answer += (min_digit - last_digit) 
    answer = str(int(floor(answer))) + subvar2 
    return answer 
 
 
def main(): 
    s = requests.Session() 
    r = s.get(URL) 
 
    if 'X-AA-Challenge' in r.text: 
        challenge = parse_challenge(r.text) 
        r = s.get(URL, headers={ 
            'X-AA-Challenge': challenge['challenge'], 
            'X-AA-Challenge-ID': challenge['challenge_id'], 
            'X-AA-Challenge-Result': challenge['challenge_result'] 
        }) 
 
        yum = r.cookies 
        r = s.get(URL, cookies=yum) 
 
    print(r.content) 
 
 
if __name__ == '__main__': 
    main() 


标签:JavaScript
声明

1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,请转载时务必注明文章作者和来源,不尊重原创的行为我们将追究责任;3.作者投稿可能会经我们编辑修改或补充。

关注我们

一个IT知识分享的公众号