Scraping Guidelines

Make sure that the project can be scraped and to what extent.
1. This means you need to run the scraping task and note which data points can be scraped, if the pagination works correctly, if it appears that there aren’t errors, and, or if no rows seem to be missing.
2. Verify rows from different parts of the website/different pages, as the results may differ.
If you leave any scraping for the analyst:
1. Provide precise instructions, including:
  1. What tools to use
  2. Whether they need to launch several scraping runs, and how to divide them
  3. If relevant, how to clean up the data in the spreadsheet (e.g., through split text to columns, removing duplicates, etc.)
  4. Tip: You may share your scraping template to reduce the scope.
2. Factor in any time needed for:
  1. Understanding scraping instructions (10-30 minutes depending on the complexity)
  2. Using the tools
  3. Cleaning up the results
If any research has to be done manually, spend at least 15 minutes completing a few rows to assess the scope.
1. As with 1b, complete rows from different parts of the website/different pages, as the results may differ.
2. Tip: If the project has a large number of rows, and manual research is still needed after scraping, do not scope very tightly. We are already likely saving tens or hundreds of hours, but from our experience, several issues may delay the completion (e.g., website malfunctioning, differences in connection speed, etc.).
Keep in mind that scoping scraping tasks will likely take longer compared to a regular strategy. It is okay and even if it results in providing the entire answer, we can charge the client on the backend. We will still likely save the client a lot of time.
Note the time spent on outlining the project or providing an answer in the project thread in #projects-in-process.
Tip: Don’t hesitate to ask for a second opinion if you aren’t sure how to scope or if you run into any roadblocks. There are a lot of variables, such as connection speed, location, and similar that can have a crucial role here.
1. If it looks like the website can be scraped, but you can’t make it work, tag Anna, Saurav, or Syed, depending on who is on shift.
Recommended tools for different types of tasks:
1. Scraping from multiple links/subpages: Octoparse
2. Scraping from a single page: Bardeen, Simplescraper
3. Scraping unstructured data: GPT Data Harvester
4. If the task seems too complicated for the above (or similar) tools, tag Syed so that he can check if the request can be scraped with Python.
If the task is very complex, and it is difficult to assess the scope of scraping and/or the manual research part, suggest launching the research in waves or starting with three hours to determine the full scope.
Examples of thorough scraping instructions provided by an RM:
1. Medical Plans Scrape
2. Google Maps Scrape