Scraping is a well-known data fetching technique that allow to obtain structured data when websites do not provide a public API for that purpose. As a task for one of our customers, at Ensolvers we had to develop a scraping system to update products from the Amazon website to maintain them updated in a database.
The approach we decided to take to this solution is divided into three steps.
Fetching the data from the Amazon website.
The strategy we adopted to capture the data is to scrape two different sections of Amazon. To obtain the core data required in our use case, we used the following URLs patterns:
While in this solution we have to do two calls to obtain information for a single product, we decided to do it this way to reduce the number of failures when fetching all this information. Using these URLs, even if there is no stock available for a product, the name and description will always be available.
Once we have successfully completed the request to Amazon, we must parse the HTML received in the request in order to use it. Since the project is developed in Ruby on Rails, we decided to use the Nokogiri gem to parse the HTML.
After that simple procedure we can easily access all the HTML elements by their id or class for getting product title, descriptions, etc.
We use regular expressions to eliminate blank spaces or unwanted characters such as vignettes or ASCII characters in the name and description. For the price we also make some replacements to be able to format it correctly to store it in our backend.
In order to have a consistent and as up-to-date database as possible we decided to adopt a product update scheme following a LRU (Least Recently Used) strategy. To avoid the IPs of our servers getting banned, a request is made to Amazon every 10 minutes and only one product is requested, the one that was updated the longest time ago. Depending on the amount of products, this periodicity can be increased or decreased. For instance, if we have a database with 30 products to update and we update every 10 minutes for each product, this gives us an update margin of about 5 hours for each product. Everything will depend on the margin of update that we place and how much variability we have with the products that we need.
Our final algorithm was implemented as follows: when the job in charge of performing the scraping fails to obtain the information from Amazon, it records an unsuccessful attempt in the database about that product. When it reaches the 3 failed attempts, it deactivates the product in the database hiding it from the users of our application. The main cause of failure is that the product has no vendor at the time.
To re-evaluate the status of the deactivated products, another job is scheduled every hour to retry the scraping of a deactivated product. In case the scraping is successful, the product is re-activated in the database and its failed attempt counter will be reset.
The process for accessing Amazon product information is quite simple. The key for success for this particular case was to have a good strategy in case of failures and possible blockages via IP by Amazon, but also other corner cases listed below have to be considered too.
Headers. When sending a GET request to Amazon it is important to note that we have to simulate it being sent from a web browser, otherwise we will receive an error. With the Faraday gem we didn't have to send any specific headers, but with the native Ruby Net::HTTP library we did have to specify them.
For example, according to Ruby documentation
Request by product and time between them. In relation to the previous point, it should be noted that making a considerable number of requests simultaneously to Amazon can cause a temporal block to that IP. To avoid that, in our case we make 2 requests per product every 10 minutes, to be well below the threshold that Amazon established for protect their servers against DDoS attacks. If a high number of requests is required because of the amount of products that need to be updated, a more complex approach should be implemented, using a cluster of containers that do the requests and a queue to uniformly distribute the requests between them.
Monitoring system. Besides the scraper itself it is important to have a monitoring and alert system to be aware in case the process fails for any reason. In our case we implemented a system of alerts in Slack to notify that products are failing when trying to update and thus be able to investigate the reason, since products are no longer visible to users after 3 failed attempts to get the information from Amazon.
With these simple considerations, we implemented an effective system that currently runs in our productive systems and updates a dynamic list of preferred products depending on the stock and availability - and also let our customer's support team when a product is no longer offered.