This tutorial is brought to you by Janeth Ledezma, the original post is here
A web scraper is a tool that allows us to select and transform websites’ data into a structured database. Here are a few of my favorite use-cases for a web scraper:
As a simple example — we’ll learn to scrape the front page of The Economist to fetch titles and their respective URLs. You can select and aggregate data, perform custom analysis, store it in Airtable, Google sheets, or share it with your team inside Slack. The possibilities are infinite!
Please remember to respect the policies around web crawlers of any sites you scrape.
Now Let’s get started!
Follow this link to set up your crawler API on Autocode: autocode.stdlib.com/new/?workflow=crawler%2Fquery%2Fselectors
You will be prompted to sign in or create a FREE account. If you have a Standard Library account click Already Registered and sign in using your Standard Library credentials.
You will be re-directed to Autocodes Maker Mode.
Maker Mode is a workflow builder like Zapier, IFTTT, and other Automation tools, with some important differences. Maker Mode generates code and it’s completely accessible for editing.
Fill the following settings;
When you input these settings notice the code generated on the right.
Select the green “Run Code” button to test run your code.
Within seconds you should see a list of titles from the front page of The Economist.
The web scraper makes a simple GET request to a URL, and runs a series of queries on the resulting page and returns it to you. It uses cheerio DOM (Document Object Model) processor, enabling us to use CSS-selectors to grab data from the page. CSS selectors are patterns used to select the element(s) you want to organize.
Web pages are written in markup languages such as HTML. An HTML element is one component of an HTML document or web page. Elements define the way information is displayed to the human eye on the browser- information such as images, multimedia, text, style sheets, scripts etc.
If you are wondering how to find the names of the elements that make up a website — allow me to show you!
Fire up Google Chrome and type in our The Economist URL address https://www.economist.com/. Then right-click on the title of any article and select “inspect.” This will open the Web Console on Google Chrome. Or you can use command key (⌘) + option key (⌥ ) + J key.
The web-developer console will open to the right of your screen.
Select the cursor located in the developer console menu or command key (⌘) + option key (⌥ ) + C.
This will enable element highlighting so that whenever you hover your cursor over the website you can quickly identify elements on the developer console.
Notice that when you selected the title of a link, a section on the console is also highlighted. The highlighted element has “class” defined as “headline-link.”
And now you know how we queried for the title of a link! 🙌🏼
You might be wondering how to customize this further. First, the resolver object attribute can take one of four values: text, html, attr and map.
To query titles links, we’ll need to set resolver to take an attr value and add an additional attr key with value href.
We would expect a response that looks like this when running the code:
We can use map to make subqueries (called mapQueries) against a selector to parse data in parallel. For example, if we want to combine the above two queries (get both title and URL simultaneously)...
Input the following setting for your selectorQueries:
This query is looking for any element <div class="teaser__text"> and then running another query against it with mapQueries.
And our result should return titles and links.
In the next tutorial, we will set up a Slack app that uses this crawler.api to query websites using a Slack slash command and posts results in a channel. Stay tuned!
I would love for you to comment here, e-mail me at Janeth [at] stdlib [dot] com, or follow Standard Library on Twitter, @StandardLibrary. Let me know if you’ve built anything exciting that you would like Standard Library team to feature or share — I’d love to help!
Janeth Ledezma is a Developer Advocate for Standard Library. Follow her journey with Standard Library on Twitter @ms_ledezma.
Discover and learn how to build software using no-code with our collection of resources, tutorials, and courses. Stay up to date with our newsletter!