The workflow
So in our ideal world scenario, it should work in the following way:
-
Pull Crawly Docker image from DockerHub.
-
Create a simple configuration file.
-
Start it!
-
Create a spider via the YML interface.
The detailed documentation and the example can be found on HexDocs here:
https://hexdocs.pm/crawly/spiders_in_yml.html#content
This article will follow all steps from scratch, so it’s self-containing. But if you have any questions, please don’t hesitate to refer to the original docs or to ping us on the
Discussions board
.
The steps
1.
First of all, we will pull Crawly from DockerHub:
docker pull oltarasenko/crawly:0.15.0
2.
We should re-use the same configuration as in our previous article as we need to get the same data as in our previous article. So let’s create a file called `crawly.config` with the same content as previously:
[{crawly, [
{closespider_itemcount, 100},
{closespider_timeout, 5},
{concurrent_requests_per_domain, 15},
{middlewares, [
'Elixir.Crawly.Middlewares.DomainFilter',
'Elixir.Crawly.Middlewares.UniqueRequest',
'Elixir.Crawly.Middlewares.RobotsTxt',
{'Elixir.Crawly.Middlewares.UserAgent', [
{user_agents, [
<<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
<<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
]
}]
}
]
},
{pipelines, [
{'Elixir.Crawly.Pipelines.Validate', [{fields, [<<"title">>, <<"author">>, <<"publishing_date">>, <<"url">>, <<"article_body">>]}]},
{'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, <<"title">>}]},
{'Elixir.Crawly.Pipelines.JSONEncoder'},
{'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
]
}]
}].
[{crawly, [
{closespider_itemcount, 100},
{closespider_timeout, 5},
{concurrent_requests_per_domain, 15},
{middlewares, [
'Elixir.Crawly.Middlewares.DomainFilter',
'Elixir.Crawly.Middlewares.UniqueRequest',
'Elixir.Crawly.Middlewares.RobotsTxt',
{'Elixir.Crawly.Middlewares.UserAgent', [
{user_agents, [
<<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
<<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
]
}]
}
]
},
{pipelines, [
{'Elixir.Crawly.Pipelines.Validate', [{fields, [<<"title">>, <<"author">>, <<"publishing_date">>, <<"url">>, <<"article_body">>]}]},
{'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, <<"title">>}]},
{'Elixir.Crawly.Pipelines.JSONEncoder'},
{'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
]
}]
}].
3.
Starting the container shall be done with the help of the following command:
docker run --name yml_spiders_example \
-it -p 4001:4001 \
-v $(pwd)/crawly.config:/app/config/crawly.config \
oltarasenko/crawly:0.15.0
Once done, you will probably see the following debug messages in your console. That is a good sign, as it worked!
Now you can open `localhost:4000` in your browser, and your journey starts here!
4.
Building a spider
Once you click Create New Spider, you will see the following basic page allowing inputting your spider code:
One may say that these interfaces are super simple a basic, as for something from 2023. It’s right. We’re backend developers, and we do what we can. So this allows for achieving needed results with minimal frontend efforts. If you have a passion for improving it or want to contribute in any other way — you are more than welcome to do so!
Writing a spider
So the interface above requires you to write a “right” YML. So you need to know what is expected. Let’s start with a basic example, add some explanations, and improve it later.
I suggest starting by inputting the following YML there:
name: ErlangSolutionsBlog
base_url: "https://www.erlang-solutions.com"
start_urls:
- "https://www.erlang-solutions.com/blog/web-scraping-with-elixir/"
fields:
- name: title
selector: "title"
links_to_follow:
- selector: "a"
attribute: "href"
Now if you click the Preview button, you shall see what spider is going to extract from your start urls:
So, what you can see here is:
The spider will extract only one field called
title
that equals to
“
Erlang Solutions.” Besides that, your spider is going to follow these links after the start page:
"https://www.erlang-solutions.com/",
"https://www.erlang-solutions.com#",
"https://www.erlang-solutions.com/consultancy/consulting/",
"https://www.erlang-solutions.com/consultancy/development/",
....
The YML format
-
name
A string representing the name of the scraper.
-
base_url
A string representing the base URL of the website being scraped. The value must be a valid URI.
-
start_urls
An array of strings representing the URLs to start scraping from. Each URL must be a valid URI.
-
links_to_follow
An array of objects representing the links to follow when scraping a page. Each object must have the following properties:
—
selector
A string representing the CSS selector for the links to follow.
—
attribute
A string representing the attribute of the link element that contains the URL to follow.
-
fields
: An array of objects representing the fields to scrape from each page. Each object must have the following properties:
—
name
A string representing the name of the field
—
selector
A string representing the CSS selector for the field to scrape.
Finishing
As in the original article, we plan to extract the following fields:
title
author
publishing_date
url
article_body
Expected selectors are copied from the previous article and can be found using Google Chrome’s inspect & copy approach!
fields:
- name: title
selector: ".page-title-sm"
- name: article_body
selector: ".default-content"
- name: author
selector: ".post-info__author"
- name: publishing_date
selector: ".header-inner .post-info .post-info__item span"
By now, you have noticed that the URL field is not added here. That’s because the URL is automatically added to every item by Crawly
Now if you hit preview again, you should see the full scraped item:
Now, if you click save, running the full spider and seeing the actual results will be possible.
Conclusion
We hope you like our work! We hope it will help reduce the need to write spider code or maybe help engage non-Elixir people who can play with it without coding!
Please don’t be too critical of the product, as everything in this world has bugs, and we plan to improve it over time (as we already did during the last 4+ years). If you have ideas and improvement suggestions, please drop us a
message so we can help
!
The post
Effortlessly Extract Data from Websites with Crawly YML
appeared first on
Erlang Solutions
.