With Dataflow Kit web scraper, extracting data is as easy as clicking the data you need.
This guide is useful as a general reference for common tasks associated with data collection building.
A collection is a set of instructions outlining the actions to be performed against a specific website. These instructions are consumed by Dataflow Kit servers to gather data from a target website thereafter.
The scraping process is based on the pattern of data you have selected. Look at the sample screenshot taken from a webshop results. Let's say we want to scrape the Image, title of item listed and the price.
Load web page.
Type (or copy and paste) a website address into the address bar and click the button next to it to load the
Web address should start with `http(s)://`
The requested web page is now loaded in "point and click" editor.
In the Selectors panel you can add new selectors to a collection, modify them and navigate the selector list.
- Start selecting elements on the web page clicking on Add
Selector button. Clicked element will be
Dataflow Kit suggests similar elements you may also want to select and mark them in .
- Optionally click on a highlighted element to remove it from selection. Removed element becomes coloured.
- Otherwise, click on unhighlighted element to add it to the current selection.
Iterate selection and rejection (steps 2 and 3) to specify needed patterns for data scraping.
This helps to refine CSS selectors even more precisely.
A number in a circle 24near selector shows a number of elements selected.
Press Apply button to finish selection. Or click Cancel to start specifying patterns again.
Once you have selected all the data you want for your first selector, repeat steps listed above to add more Selectors to the Collection.
Clicking on a selector highlights their corresponded elements on the loaded web page.
Find more information about selector types, options in Selectors documentation
Websites that contain long lists of items frequently break these up into pages. Navigating through different pages on a website is an integral part of the web scraping process. Paginator is used to scrape multiple pages or process infinite scrolled pages.
Scroll up or down the web page until you can see the button or link that navigates to the next page.
Click Add Paginator button and choose the one from the drop down list of paginator types:
- "Next" link paginator type is used on pages containing link pointing to a next page. The next page link is extracted from a document by querying href attribute of a given element's CSS selector.
- Infinite scroll paginator type
content while user scrolls a page down.
Note: You don't have to specify CSS Selector for `Infinite scroll` paginators.
- "Load more" button paginator type is a very small variable of `infinite scroll`. While the trigger for content loading is the page scroll in case of infinite scroll, the user is required to click on a `Load more button` in this case.
Selector represents corresponding CSS selector for the ` "Next" link` or ` "Load more" button` paginator types.
The scraper is now configured to go to the next page (and all remaining pages) after collecting all data from the current page.
If there is no paginator specified, then it is assumed that the initial URL is the only page to scrape.
The Link selector type might serve as a link to a product details page, so we can click on it in order to navigate to the details page and gather additional data.
Click on `Details` link. This loads the product details page, where you can collect additional data about each individual product.
You can repeat steps described in Select elements section for each additional piece of information you want to collect from details page.
Preview. Extract data
After you have added selectors, set up paginator and details сlick Preview to have an idea of what data will be extracted. Once the process runs, you will see coming data in the Data Viewer and keep informed on progress.
Click Button to interrupt extraction at any time.
Choose CSV, JSON, JSON Lines, Excel or XML output format and click Launcn button to start data extraction.
Right after completion of data extraction click ` Download` link to fetch results.
The steps in this walkthrough may not be fully applicable to all websites from which you might desire to gather data. For help with specific data extraction tasks not found in this Getting Started guide, search the Help Center for relevant articles.