Crawler browses the student portfolio websites acting as user and stores the scraped data to the database for the research purposes.
install dependencies
npm i
start crawl
npm run start
continue last crawl
CRAWLEE_PURGE_ON_START=0 npm start
  Crawler stores data to portfolio_pages table in our Supabase Postgres database.
| column | description | 
|---|---|
| url | crawled url used as unique key | 
| data | scraped data in jsonb | 
| created_at | timestamp of first creation | 
| portfolio_id | uuid of the portfolio | 
| updated_at | timestamp of last update (last crawl ended) | 
The scraped data are stored as json data structure, see its documentation.
Crawled data are stored as jsonb type in the postgres database. There is variety of build-in functions and operators to query the json or even index its parameters.
get all pages with successfully parsed post-content (the where clause excludes pages such as 404, search, archive, tag etc.)
select * from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?|	array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';
get the plaintext of scraped posts
select data->'post-content'->'text' as text from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?|	array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';
get the title, description and open graph image of scraped posts
select data->>'title' as title, data->>'description' as description, data->'og'->'image'->'imageValue' as og_image from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?|	array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';
get the number of posts with and without parsed post-content
below are the stats from last crawl
| metric | count | 
|---|---|
| total_crawled | 3837 | 
| total_single | 1537 | 
| total_page | 472 | 
| single_without_tree | 0 | 
| single_with_tree | 1537 | 
| single_with_tree_published_at | 1442 | 
| page_without_tree | 16 | 
| page_with_tree | 456 | 
| with_with_tree_published_at | 368 | 
select
  count(*) as total_crawled,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['single']
    ) as total_single,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['page']
    ) as total_page,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['single']
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') = 'null'
    ) as single_without_tree,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['single']
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') != 'null'
    ) as single_with_tree,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['single']
      and json_typeof(data->'published-at') != 'null'
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') != 'null'
    ) as single_with_tree_published_at,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['page']
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') = 'null'
    ) as page_without_tree,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['page']
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') != 'null'
    ) as page_with_tree,
    sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['page']
      and json_typeof(data->'published-at') != 'null'
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') != 'null'
    ) as with_with_tree_published_at
from
  portfolio_pages
Generated using TypeDoc