Come posso raschiare più velocemente

16

Il lavoro qui è quello di raschiare un'API un sito che inizia da https://xxx.xxx.xxx/xxx/1.jsona https://xxx.xxx.xxx/xxx/1417749.jsone scriverlo esattamente su mongodb. Per questo ho il seguente codice:

client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min = 1
max = 1417749
for n in range(min, max):
    response = requests.get("https:/xx.xxx.xxx/{}.json".format(str(n)))
    if response.status_code == 200:
        parsed = json.loads(response.text)
        inserted = com.insert_one(parsed)
        write_log.write(str(n) + "\t" + str(inserted) + "\n")
        print(str(n) + "\t" + str(inserted) + "\n")
write_log.close()

Ma ci vuole molto tempo per svolgere il compito. La domanda qui è come posso accelerare questo processo.

— Tek Nath
fonte

Hai provato prima a confrontare il tempo impiegato per elaborare il singolo json? Supponendo che siano necessari 300 ms per record, è possibile elaborare tutti questi record in sequenza in circa 5 giorni.

— tuxdna,

5

asyncio è anche una soluzione se non si desidera utilizzare il multi threading

import time
import pymongo
import json
import asyncio
from aiohttp import ClientSession


async def get_url(url, session):
    async with session.get(url) as response:
        if response.status == 200:
            return await response.text()


async def create_task(sem, url, session):
    async with sem:
        response = await get_url(url, session)
        if response:
            parsed = json.loads(response)
            n = url.rsplit('/', 1)[1]
            inserted = com.insert_one(parsed)
            write_log.write(str(n) + "\t" + str(inserted) + "\n")
            print(str(n) + "\t" + str(inserted) + "\n")


async def run(minimum, maximum):
    url = 'https:/xx.xxx.xxx/{}.json'
    tasks = []
    sem = asyncio.Semaphore(1000)   # Maximize the concurrent sessions to 1000, stay below the max open sockets allowed
    async with ClientSession() as session:
        for n in range(minimum, maximum):
            task = asyncio.ensure_future(create_task(sem, url.format(n), session))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses


client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min_item = 1
max_item = 100

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(min_item, max_item))
loop.run_until_complete(future)
write_log.close()

— Frans
fonte

1

L'uso dell'asincrono ha funzionato più velocemente del multi threading.

— Tek Nath,

Grazie per il feedback. Risultato interessante

— Frans,

10

Ci sono diverse cose che potresti fare:

Riutilizzare la connessione. Secondo il benchmark di seguito è circa 3 volte più veloce
È possibile raschiare in più processi in parallelo

Codice parallelo da qui

from threading import Thread
from Queue import Queue
q = Queue(concurrent * 2)
for i in range(concurrent):
    t = Thread(target=doWork)
    t.daemon = True
    t.start()
try:
    for url in open('urllist.txt'):
        q.put(url.strip())
    q.join()
except KeyboardInterrupt:
    sys.exit(1)

Tempi da questa domanda per una connessione riutilizzabile

>>> timeit.timeit('_ = requests.get("https://www.wikipedia.org")', 'import requests', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
...
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
52.74904417991638
>>> timeit.timeit('_ = session.get("https://www.wikipedia.org")', 'import requests; session = requests.Session()', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
15.770191192626953

— keiv.fly
fonte

6

Puoi migliorare il tuo codice su due aspetti:

Utilizzando a Session, in modo che una connessione non venga riorganizzata ad ogni richiesta e venga mantenuta aperta;
Utilizzo del parallelismo nel codice con asyncio;

Dai un'occhiata qui https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

— albestro
fonte

2

Puoi aggiungere qualche dettaglio in più?

— Tek Nath,

4

Quello che probabilmente stai cercando è la raschiatura asincrona. Ti consiglierei di creare alcuni lotti di URL, ovvero 5 URL (cerca di non modificare il sito Web) e di raschiarli in modo asincrono. Se non sai molto di asincrono, google per il libary asyncio. Spero di poterti aiutare :)

— T Piper
fonte

1

Puoi aggiungere qualche dettaglio in più.

— Tek Nath,

3

Prova a bloccare le richieste e usa l'operazione di scrittura in blocco MongoDB.

raggruppa le richieste (100 richieste per gruppo)
Scorrere i gruppi
Utilizza il modello di richiesta asincrono per recuperare i dati (URL in un gruppo)
Aggiorna il DB dopo aver completato un gruppo (operazione di scrittura in blocco)

Ciò potrebbe far risparmiare molto tempo nei seguenti modi: * Latenza di scrittura MongoDB * Latenza delle chiamate di rete sincrone

Ma non aumentare il conteggio delle richieste parallele (dimensione del blocco), aumenterà il carico di rete del server e il server potrebbe considerarlo un attacco DDoS.

https://api.mongodb.com/python/current/examples/bulk.html

— thuva4
fonte

1

Potete aiutarmi con il codice per raggruppare le richieste e recuperare il gruppo

— Tek Nath

3

Supponendo che non venga bloccato dall'API e che non vi siano limiti di velocità, questo codice dovrebbe rendere il processo 50 volte più veloce (forse di più perché tutte le richieste vengono ora inviate utilizzando la stessa sessione).

import pymongo
import threading

client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
logs=[]

number_of_json_objects=1417750
number_of_threads=50

session=requests.session()

def scrap_write_log(session,start,end):
    for n in range(start, end):
        response = session.get("https:/xx.xxx.xxx/{}.json".format(n))
        if response.status_code == 200:
            try:
                logs.append(str(n) + "\t" + str(com.insert_one(json.loads(response.text))) + "\n")
                print(str(n) + "\t" + str(inserted) + "\n")
            except:
                logs.append(str(n) + "\t" + "Failed to insert" + "\n")
                print(str(n) + "\t" + "Failed to insert" + "\n")

thread_ranges=[[x,x+number_of_json_objects//number_of_threads] for x in range(0,number_of_json_objects,number_of_json_objects//number_of_threads)]

threads=[threading.Thread(target=scrap_write_log, args=(session,start_and_end[0],start_and_end[1])) for start_and_end in thread_ranges]

for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

with open("logging.log", "a") as f:
    for line in logs:
        f.write(line)

— Ibrahim Dar
fonte

2

Mi è capitato di avere la stessa domanda molti anni fa. Non sono mai soddisfatto delle risposte basate su Python, che sono piuttosto lente o troppo complicate. Dopo essere passato ad altri strumenti maturi, la velocità è veloce e non torno più indietro.

Recentemente uso tali passaggi per accelerare il processo come segue.

genera un sacco di URL in txt
utilizzare aria2c -x16 -d ~/Downloads -i /path/to/urls.txtper scaricare questi file
analizzare localmente

Questo è il processo più veloce che mi è arrivato finora.

In termini di scraping delle pagine Web, ho anche scaricato * .html necessario, invece di visitare la pagina una volta alla volta, il che in realtà non fa differenza. Quando vai a visitare la pagina, con strumenti Python come requestso scrapyo urllib, continua a memorizzare nella cache e scaricare l'intero contenuto Web per te.

— anonimo
fonte

1

Prima di tutto crea un elenco di tutti i link perché sono tutti uguali, basta cambiarlo.

list_of_links=[]
for i in range(1,1417749):
    list_of_links.append("https:/xx.xxx.xxx/{}.json".format(str(i)))

t_no=2
for i in range(0, len(list_of_links), t_no):
    all_t = []
    twenty_links = list_of_links[i:i + t_no]
    for link in twenty_links:
        obj_new = Demo(link,)
        t = threading.Thread(target=obj_new.get_json)
        t.start()
        all_t.append(t)
    for t in all_t:
        t.join()

class Demo:
    def __init__(self, url):
        self.json_url = url

def get_json(self):
    try:
       your logic
    except Exception as e:
       print(e)

Semplicemente aumentando o diminuendo t_no puoi cambiare no di thread ..

— mobin alhassan
fonte