Quantcast
Channel: User Bazinga - Stack Overflow
Viewing all articles
Browse latest Browse all 37

Scraping website using Celery

$
0
0

Currently, my structure is Flask, Redis, RabbitMQ and Celery. In my scraping, I am using requests and BeautifulSoup. My flask is running on apache and wsgi. This is on prod. With app.run(threaded=True)

I have 25 APIs. 10 are to scrape the URL like headers, etc. , and the rest is to use a 3rd party API for that URL.

I am using chord for processing my APIs and getting data from the APIs using requests.

For my chord header I have 3 workers, while on my callback I only have 1.I am having a bottleneck issue of having ConnectTimeoutError and MaxRetryError. As I read some thread it said to do a timeout for every process, because having this error means you are overloading the remote server.

The problem is since I am using a chord there is no sense to use a time sleep since the 25 API call will be run at the same time. Have anyone encountered this? Or am I doing this wrong?

The thread I read seem to be saying to change the requests to pycurl or use Scrapy. But I dont think that's the case since ConnectTimeoutError is about my host overloading a specific URLs server.

My chord process:

callback = create_document.s(url, company_logo, api_list)header = [api_request.s(key) for key in api_list.keys()]result = chord(header)(callback)

In api_request task requests is used.


Viewing all articles
Browse latest Browse all 37

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>