Pythonプロキシサーバーとは何か

このチュートリアルでは以下を説明します:

Pythonプロキシサーバーの概要と仕組み
PythonでHTTPプロキシサーバーを構築するステップ
このアプローチの長所と短所

さっそく始めましょう！

Pythonプロキシサーバーとは何か

Pythonプロキシサーバーは、クライアントとインターネットの仲介役として機能する Pythonアプリケーションです。クライアントからの要求をインターセプトしてターゲットサーバーに転送し、応答をクライアントに返します。これにより、クライアントのIDを送信先サーバーから隠すことができます。

この記事ではプロキシサーバーの概要と仕組みを説明します。

Pythonのソケットプログラミング機能を使うと、基本的なプロキシサーバーを簡単に実装でき、ユーザーはネットワークトラフィックを検査、変更、リダイレクトできます。Webスクレイピングに関して言えば、プロキシサーバーはキャッシュ、パフォーマンスの向上、セキュリティの強化に最適です。

PythonでHTTPプロキシサーバーを実装する方法

Pythonプロキシサーバースクリプトを作成する方法を説明します。

ステップ1: Pythonプロジェクトを初期化する

始める前に、マシンにPython 3+がインストールされていることを確認します。インストールされていない場合、インストーラーをダウンロードして実行し、インストールウィザードの指示に従います。

次に、以下のコマンドを使ってpython-http-proxy-serverフォルダを作成し、仮想環境を含む Pythonプロジェクトを初期化します。

mkdir python-http-proxy-server

cd python-http-proxy-server

python -m venv env

Python IDEでpython-http-proxy-serverフォルダを開き、空のproxy_server.pyファイルを作成します。

素晴らしい！PythonでHTTPプロキシサーバーを構築するために必要なものがすべて揃いました。

ステップ 2: 着信ソケットを初期化する

まず、着信要求を受け入れるためのWebソケットサーバーを作成する必要があります。この概念に慣れていない方のために説明すると、ソケットとはクライアントとサーバー間の双方向データフローを実現するローレベル・プログラミング・アブストラクション(抽象化)です。Webサーバーにおいては、サーバーソケットはクライアントからの着信接続をリッスンするために使います。

以下の行を使って、PythonでソケットベースのWebサーバーを作成します:

port = 8888

# bind the proxy server to a specific address and port

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# accept up to 10 simultaneous connections

server.bind(('127.0.0.1', port))

server.listen(10)

これにより、着信ソケットサーバーが初期化され、http://127.0.0.1:8888 ローカルアドレスにバインドされます。これで、サーバーがlisten()メソッドで接続を受け入れられるようになりました。

注: Webプロキシがリッスンするポート番号は自由に変更できます。また、コマンドラインからその情報を読み込むようにスクリプトを変更して、柔軟性を高めることもできます。

ソケットは Python標準ライブラリのものを使います。そのため、スクリプトの上に以下のインポートがあります。

import socket

必要に応じてPythonプロキシサーバーが起動したことを確認するには、このメッセージを記録します:

 print(f"Proxy server listening on port {port}...")

ステップ3: クライアント要求を受け付ける

クライアントがプロキシサーバーに接続する際、そのクライアントとの通信を処理するためにソケットを新規作成する必要があります。Pythonでこれを行う方法はこの通りです:

# listen for incoming requests

while True:

    client_socket, addr = server.accept()

    print(f"Accepted connection from {addr[0]}:{addr[1]}")

    # create a thread to handle the client request

    client_handler = threading.Thread(target=handle_client_request, args=(client_socket,))

    client_handler.start()

複数のクライアント要求を同時に処理するには、上記のようにマルチスレッディングを使う必要があります。Python標準ライブラリからスレッディングを忘れずにインポートします:

import threading

ご覧の通り、プロキシサーバーはカスタムhandle_client_request()関数を使って着信要求を処理します。以下のステップで定義方法を説明します。

ステップ4: 着信要求を処理する

クライアントソケットが作成されたら、それを使って以下を行います。

着信要求からデータを読み取ります。
そのデータからターゲットサーバーのホストとポートを抽出します。
それを使って、クライアント要求を宛先サーバーに転送します。
応答を取得して元のクライアントに転送します。

このセクションでは、最初の2ステップに注目します。handle_client_request()関数を定義し、それを使って着信要求からデータを読み取ります:

def handle_client_request(client_socket):

    print("Received request:n")

    # read the data sent by the client in the request

    request = b''

    client_socket.setblocking(False)

    while True:

        try:

            # receive data from web server

            data = client_socket.recv(1024)

            request = request + data

            # Receive data from the original destination server

            print(f"{data.decode('utf-8')}")

        except:

            break

setblocking(False)はクライアントソケットをノンブロッキングモードに設定します。次に、recv()で着信データを読み取り、それをバイト形式で要求に追加します。着信要求データのサイズが未知のため、1チャンクずつ読み取る必要があります。この例では1024バイトのチャンクが指定されています。ノンブロッキングモードでは、recv()がデータを見つけられないと、エラー例外を発生させます。従って、except命令は操作の終了を示します。

Pythonプロキシサーバーの動作を確認するため、ログのメッセージに注意します。

受信要求を取得したら、そこから送信先サーバーのホストとポートを抽出する必要があります:

host, port = extract_host_port_from_request(request)

In particular, this is what the extract_host_port_from_request() function looks like:

def extract_host_port_from_request(request):

    # get the value after the "Host:" string

    host_string_start = request.find(b'Host: ') + len(b'Host: ')

    host_string_end = request.find(b'rn', host_string_start)

    host_string = request[host_string_start:host_string_end].decode('utf-8')

    webserver_pos = host_string.find("/")

    if webserver_pos == -1:

        webserver_pos = len(host_string)

    # if there is a specific port

    port_pos = host_string.find(":")

    # no port specified

    if port_pos == -1 or webserver_pos < port_pos:

        # default port

        port = 80

        host = host_string[:webserver_pos]

    else:

        # extract the specific port from the host string

        port = int((host_string[(port_pos + 1):])[:webserver_pos - port_pos - 1])

        host = host_string[:port_pos]

    return host, port

To better understand what it does, consider the example below. This is what the encoded string of an incoming request usually contains:

GET http://example.com/your-page HTTP/1.1

Host: example.com

User-Agent: curl/8.4.0

Accept: */*

Proxy-Connection: Keep-Alive

extract_host_port_from_request() は、「Host:」フィールドからWebサーバーのホストとポートを抽出します。この場合、ホストはexample.com、ポートは80です(ポートが未指定のため)。

ステップ5: クライアント要求を転送して応答を処理する

ターゲットホストとポートを取得したら、クライアント要求を宛先サーバーに転送します。handle_client_request()でWebソケットを作成し、それを使って要求を宛先サーバーに送信します:

# create a socket to connect to the original destination server

destination_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# connect to the destination server

destination_socket.connect((host, port))

# send the original request

destination_socket.sendall(request)

Then, get ready to receive the server response and propagate it to the original client:

# read the data received from the server

# once chunk at a time and send it to the client

print("Received response:n")

while True:

    # receive data from web server

    data = destination_socket.recv(1024)

    # Receive data from the original destination server

    print(f"{data.decode('utf-8')}")

    # no more data to send

    if len(data) > 0:

        # send back to the client

        client_socket.sendall(data)

    else:

        break

この時もレスポンスのサイズが未知のため、1チャンクずつ処理する必要があります。データが空になると受信するデータがなくなるため、操作を終了できます。

関数で定義した 2 つのソケットを忘れずに閉じます:

# close the sockets

destination_socket.close()

client_socket.close()

素晴らしい！これで、HTTPプロキシサーバーをPythonで作成しました。コード全体を確認して起動し、期待通りに動作することを確認しましょう。

ステップ 6: すべてをまとめる

これはPythonプロキシサーバースクリプトの最終コードです:

import socket

import threading

def handle_client_request(client_socket):

    print("Received request:n")

    # read the data sent by the client in the request

    request = b''

    client_socket.setblocking(False)

    while True:

        try:

            # receive data from web server

            data = client_socket.recv(1024)

            request = request + data

            # Receive data from the original destination server

            print(f"{data.decode('utf-8')}")

        except:

            break

    # extract the webserver's host and port from the request

    host, port = extract_host_port_from_request(request)

    # create a socket to connect to the original destination server

    destination_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    # connect to the destination server

    destination_socket.connect((host, port))

    # send the original request

    destination_socket.sendall(request)

    # read the data received from the server

    # once chunk at a time and send it to the client

    print("Received response:n")

    while True:

        # receive data from web server

        data = destination_socket.recv(1024)

        # Receive data from the original destination server

        print(f"{data.decode('utf-8')}")

        # no more data to send

        if len(data) > 0:

            # send back to the client

            client_socket.sendall(data)

        else:

            break

    # close the sockets

    destination_socket.close()

    client_socket.close()

def extract_host_port_from_request(request):

    # get the value after the "Host:" string

    host_string_start = request.find(b'Host: ') + len(b'Host: ')

    host_string_end = request.find(b'rn', host_string_start)

    host_string = request[host_string_start:host_string_end].decode('utf-8')

    webserver_pos = host_string.find("/")

    if webserver_pos == -1:

        webserver_pos = len(host_string)

    # if there is a specific port

    port_pos = host_string.find(":")

    # no port specified

    if port_pos == -1 or webserver_pos < port_pos:

        # default port

        port = 80

        host = host_string[:webserver_pos]

    else:

        # extract the specific port from the host string

        port = int((host_string[(port_pos + 1):])[:webserver_pos - port_pos - 1])

        host = host_string[:port_pos]

    return host, port

def start_proxy_server():

    port = 8888

    # bind the proxy server to a specific address and port

    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    server.bind(('127.0.0.1', port))

    # accept up to 10 simultaneous connections

    server.listen(10)

    print(f"Proxy server listening on port {port}...")

    # listen for incoming requests

    while True:

        client_socket, addr = server.accept()

        print(f"Accepted connection from {addr[0]}:{addr[1]}")

        # create a thread to handle the client request

        client_handler = threading.Thread(target=handle_client_request, args=(client_socket,))

        client_handler.start()

if __name__ == "__main__":

    start_proxy_server()

Launch it with this command:

python proxy_server.py

ターミナルに以下のメッセージが表示されるはずです。

Proxy server listening on port 8888...

サーバーの動作を確認するには、cURLを使ってプロキシ要求を実行します。プロキシでcURLを使用する方法の詳細は、こちらのガイドを参照してください。

新しいターミナルを開いて以下を実行します:

curl --proxy "http://127.0.0.1:8888" "http://httpbin.org/ip"

これにより、http://127.0.0.1:8888プロキシサーバー経由でhttp://httpbin.org/ipにGET要求が送信されます。

以下のような結果が得られます:

{

  "origin": "45.12.80.183"

}

これはプロキシサーバーのIPです。これはどうしてでしょうか？HTTPBinプロジェクトの/ipエンドポイントは、要求送信元のIPを返すためです。サーバーをローカルで実行している場合、「origin」はローカルマシンのIPです。

注: ここで構築したPythonプロキシサーバーはHTTP宛先でのみ動作します。HTTPS接続を処理できるように拡張するのはかなり困難です。

次に、プロキシサーバーのPythonアプリケーションのログを見てみます。以下が含まれているはずです:

Received request:

GET http://httpbin.org/ip HTTP/1.1

Host: httpbin.org

User-Agent: curl/8.4.0

Accept: */*

Proxy-Connection: Keep-Alive

Received response:

HTTP/1.1 200 OK

Date: Thu, 14 Dec 2023 14:02:08 GMT

Content-Type: application/json

Content-Length: 31

Connection: keep-alive

Server: gunicorn/19.9.0

Access-Control-Allow-Origin: *

Access-Control-Allow-Credentials: true

{

  "origin": "45.12.80.183"

}

これにより、プロキシサーバーがHTTPプロトコルが指定した形式で要求を受信したことが分かります。次に、それを宛先サーバーに転送し、応答データを記録し、その応答をクライアントに返しました。なぜそれが分かるのでしょう？「オリジン」のIPが同じだからです。

おめでとうございます！これで、PythonでHTTPプロキシサーバーを構築する方法を学びました。

カスタムPythonプロキシサーバーを使う場合の長所と短所

Pythonでプロキシサーバーを実装する方法が分かったので、今度はこのアプローチの長所と短所を学びましょう。

長所:

完全な制御: このようなカスタムPythonスクリプトを使うと、プロキシサーバーの動作を完全に制御できます。怪しげな活動やデータ漏洩を防ぐことができます。
カスタマイズ: プロキシサーバーを拡張して要求のロギングやキャッシュなどの便利な機能を追加し、パフォーマンスを向上させることができます。

短所:

インフラのコスト: プロキシサーバー・アーキテクチャのセットアップは簡単ではなく、ハードウェアやVPSサービスにも多額の費用がかかります。
メンテナンス性: プロキシのアーキテクチャ、特にスケーラビリティと可用性を維持するの必要があります。これは経験豊富なシステム管理者を必要とするタスクです。
信頼性: このソリューションの大きな問題は、プロキシサーバーの出口IPが固定であることです。その結果、アンチボット技術はそのIPをブロックし、サーバーが要求にアクセスするのを防ぐことができます。言い換えれば、最終的にはプロキシが機能しなくなるということです。

これらの制限と欠点があるため、カスタムPythonプロキシサーバーを本番環境で使うことは困難です。では何を使えば良いのでしょう？Bright Dataのような信頼できるプロキシプロバイダーです！アカウントを作成し、本人確認を行い、無料のプロキシを入手し、好みのプログラミング言語で使えます。たとえば、は要求でプロキシを Pythonスクリプトに統合します。

弊社の巨大なプロキシネットワークは、世界中の何百万台もの高速で信頼性が高く安全なプロキシサーバーで構成されています。弊社が最良のプロキシサーバープロバイダーである理由をご覧ください。

まとめ

このガイドでは、プロキシサーバーの概要とPythonでどのように機能するかを説明しました。Webソケットを使ってプロキシサーバーをゼロから構築する方法を詳しく説明しました。あなたはPythonのプロキシの達人になったと言っても良いでしょう。このアプローチの大きな問題は、プロキシサーバーの出口IPが静的であるために、最終的にブロックされてしまうことです。Bright Dataのローテーション・プロキシならブロックを回避できます

Bright Dataは世界最高のプロキシサーバーを管理しており、フォーチュン500企業と20,000社以上の顧客にサービスを提供しています。様々なタイプのプロキシを提供しています: