[

Hide

]

1. Thư viện Goutte
2. Sử dụng
2.1 Lấy html của trang cần crawl
2.2 Lọc các thẻ
2.3 Lấy text của một element
2.4 Lấy giá trị thuộc tính của một element
3. Ứng dụng
Kết Luận

Bạn muốn lấy các dữ liệu từ một website để sử dụng cho mục đích riêng của mình như: làm webservice hay lưu trữ về dạng json hay database thì hãy cùng mình tìm hiểu nhé!

1. Thư viện Goutte

document của thư viện: https://goutte.readthedocs.io/en/latest/
Hướng dẫn cách cài: dùng composer của php chạy lệnh bên dưới

composer require fabpot/goutte

Nếu chưa cài composer thì vào Bước đầu sử dụng Laravel ở mục số 2 mình có hướng dẫn + link download nhé!

2. Sử dụng

Sau khi chạy lệnh trên bạn sẽ được như hình sau:

Bạn tạo file mà bạn sẽ viết code trong đó. Như hình trên mình tạo filevidu.phpTrong file ví dụ bạn viết như sau trước khi sử dụng thư viện.

<?php 
require('vendor/autoload.php');

use Goutte\Client;

$client = new Client();

2.1 Lấy html của trang cần crawl

$crawler = $client->request('GET', 'https://freetuts.net/hoc-php');
/* Thay đường dẫn https://freetuts.net/hoc-php lại cho phù hợp nhé */

2.2 Lọc các thẻ

//Lọc theo thẻ
$crawler->filter('h2')->each(function ($node) {
    print $node->text();
});

//Lọc theo thẻ có classname
$crawler->filter('span.author')->each(function ($node) {
    print $node->text();
});

//Lọc theo thẻ có id
$crawler->filter('span#author')->each(function ($node) {
    print $node->text();
});

//Lọc theo thuộc tính
$crawler->filter('[href="http://abc.com"]')->each(function ($node) {
    print $node->text();
});

2.3 Lấy text của một element

//$node->text() để lấy
$crawler->filter('span.author')->each(function ($node) {
    print $node->text();
});

/*
    Hàm each là duyệt tất cả các phẩn tử trong mảng
    Tương tự như foreach
    Nhưng này là hàm được định nghĩa trong thư viện
*/

2.4 Lấy giá trị thuộc tính của một element

//$node->attr('ThuocTinhCanLay')
$crawler->filter('span.author')->each(function ($node) {
    print $node->attr('href');
});

Còn nhiều phương thức khác được định nghĩa trong vendor/symfony/dom-crawler/Crawler.php bạn xem thêm nhé!

3. Ứng dụng

Ví dụ: lấy tất cả hình của trang https://freetuts.net/hoc-php

// vào thư mục hình

<?php 
require('vendor/autoload.php');

use Goutte\Client;

$client = new Client();

$crawler = $client->request('GET', 'https://freetuts.net/hoc-php');

$crawler->filter('img')->each(function ($node) {
    $s = $node->attr('src');
    $img = 'hinh/'. basename($s);
    file_put_contents($img, file_get_contents($s));
});

/*
  $img = 'hinh/'. basename($s);
  file_put_contents($img, file_get_contents($s));

    * Code này để lưu hình ảnh vào thư mục hinh
    * Với tên được cắt theo tên file của đường dẫn $s
*/

Ví dụ khác muốn lấy chữ Đăng bởi: Administrator của trang https://freetuts.net/hoc-php

<?php 
require('vendor/autoload.php');

use Goutte\Client;

$client = new Client();

$crawler = $client->request('GET', 'https://freetuts.net/hoc-php');

$crawler->filter('span.author')->each(function ($node) {
    print $node->text();
});

Video clip demo 2 ví dụ trên