Selectors

Dataflow kit is no-coding-skills-required platform for web data extraction. So in most cases it is enough to point and select needed elements on loaded page to scrape data.

Dataflow kit uses CSS selectors to find HTML elements in web pages and to extract data from. DFK engine makes its best guess what the CSS selector might be for the selected elements. But sometimes you may specify CSS selector values manually. At the bottom of the page is a queue of links describing CSS Selectors.

Selector types

selector types
There are 3 possible selector types:
  •  Text selector extracts human-readable text from the selected element and from all its child elements. HTML tags are stripped and only text is returned.
  •  Link is used for link extraction and website navigation. Capture `href` attribute (URL), text or specify a special `Path` option for navigation only. When "Path" option specified, all other selectors become disable and no results from the current page will be returned.
  •  Image selector extracts src (URL) and alt attributes of an image.

Filters

Filters are used to manipulate text data when extracting.

The following filters are available:

  • Trim returns a copy of the Extractor's text/ attribute, with all leading and trailing white space removed.
  • Normal case leaves the case and capitalization of text/ attribute exactly as is.
  • UPPERCASE makes all of the letters in the Extractor's text/ attribute uppercase.
  • lowercase makes all of the letters in the Extractor's text/ attribute lowercase.
  • Capitalize capitalizes the first letter of each word in the Extractor's text/ attribute

Filters are available for Text, Link and Image extractor types. Image alt attribute, Link Text and Text are influenced by specified filters.

selector filters

Regex

The regular expression can be used to extract a substring of the text that the selector extracts.
The whole match (group 0) will be returned as a result.
Some useful examples are listed in the table.

RegExr is an online tool to learn, build, & test Regular Expressions.

text regex result
price: 10.99$ [0-9]+\.[0-9]+ 10.99
id: H18JKDX4 [A-Z0-9]{8} H18JKDX4
date: 2018-10-19 [0-9]{4}\-[0-9]{2}\-[0-9]{2} 2018-10-19

Rename Collection. Modify selectors. Change CSS selectors' values.

  1. Give meaningful names to Collection and selectors instead of generated ones while selecting elements on the page. Double click on element's name to rename it.
    Note: In the output spreadsheet, the selector name will become the header for the column containing the data you collected.
  2. Specify CSS Selector value for web elements. Double click it to enter new value manually.

Delete Selectors.

Delete any selector from collection anytime by clicking `Trash` button if you don't need it anymore.