Tips Building Web Scrapers with Puppeteer + Node.js

ductriho · Sep 29, 2023

[TIẾNG VIỆT]:
## Xây dựng bộ phế liệu web với Puppeteer + Node.js

Quét web là quá trình trích xuất dữ liệu từ các trang web.Nó có thể được sử dụng cho một loạt các mục đích, chẳng hạn như phân tích dữ liệu, giám sát giá và nghiên cứu thị trường.

Puppeteer là một thư viện Node.js cho phép bạn kiểm soát các trình duyệt Chrome hoặc Chromium không đầu.Điều này làm cho nó trở thành một công cụ mạnh mẽ để quét web, vì bạn có thể sử dụng nó để tự động hóa quá trình trích xuất dữ liệu từ các trang web.

Trong hướng dẫn này, chúng tôi sẽ chỉ cho bạn cách xây dựng một cái cào web với Puppeteer và Node.js.Chúng tôi sẽ quét dữ liệu từ trang web [TechCrunch] (https://techcrunch.com/) và chúng tôi sẽ lưu dữ liệu vào tệp CSV.

### Điều kiện tiên quyết

Để làm theo hướng dẫn này, bạn sẽ cần những điều sau đây:

* Node.js v14 trở lên
* Thư viện múa rối
* Trình chỉnh sửa văn bản hoặc IDE

### Cài đặt Puppeteer

Bạn có thể cài đặt Puppeteer bằng lệnh sau:

`` `
NPM Cài đặt Puppeteer
`` `

### Tạo một cái cào web

Để tạo một cái cào web, bạn sẽ cần tạo một dự án Node.js mới.Bạn có thể làm điều này bằng cách chạy lệnh sau:

`` `
NPM init
`` `

Điều này sẽ tạo ra một thư mục mới gọi là `my-project`.Bên trong thư mục này, hãy tạo một tệp mới có tên là `index.js`.

Trong tệp `index.js`, bạn sẽ cần nhập thư viện Puppeteer.Bạn có thể làm điều này bằng cách thêm dòng sau vào đầu tệp:

`` `
const puppeteer = Yêu cầu ('Puppeteer');
`` `

Bạn cũng sẽ cần tạo một chức năng sẽ bắt đầu trình duyệt Chrome không đầu và điều hướng đến trang web TechCrunch.Bạn có thể làm điều này bằng cách thêm mã sau vào tệp `index.js`:

`` `
hàm async Scrape () {
// Tạo một phiên bản trình duyệt mới
const trình duyệt = Await Puppeteer.launch ({
Không có đầu: Đúng,
});

// Điều hướng đến trang web TechCrunch
const page = Await trình duyệt.newpage ();
đang chờ trang.goto ('https://techcrunch.com/');
}
`` `

Chức năng `Scrape ()` sẽ bắt đầu trình duyệt Chrome không đầu và điều hướng đến trang web TechCrunch.Trình duyệt sẽ không có đầu, điều đó có nghĩa là nó sẽ không hiển thị bất kỳ đồ họa nào.Điều này rất hữu ích cho việc quét web, vì nó cho phép bạn cào các trang web mà không phải lo lắng về giao diện người dùng của trình duyệt.

### Dữ liệu cạo từ trang web TechCrunch

Bây giờ bạn đã có một cái cạo web, bạn có thể sử dụng nó để cạo dữ liệu từ trang web TechCrunch.Để làm điều này, bạn sẽ cần sử dụng phương thức `page.evaliated ()`.Phương thức `page.evaliated ()` cho phép bạn chạy mã JavaScript trên trang.

Bạn có thể sử dụng phương thức `page.evaliated ()` để có được HTML của trang.Sau đó, bạn có thể sử dụng một biểu thức thông thường để trích xuất dữ liệu mà bạn muốn.

Ví dụ: mã sau đây sẽ nhận được tiêu đề của bài viết đầu tiên trên trang web TechCrunch:

`` `
const title = Await page.evaliated (() => document.QuerySelector ('. bài viết tên'). textContent);
`` `

Phương thức `page.evaliated ()` sẽ trả về tiêu đề của bài viết đầu tiên dưới dạng chuỗi.Sau đó, bạn có thể lưu tiêu đề vào một biến.

Bạn có thể sử dụng cùng một phương thức để trích xuất dữ liệu khác từ trang web TechCrunch, chẳng hạn như tác giả của bài viết, ngày xuất bản và nội dung.

### Lưu dữ liệu vào tệp CSV

Khi bạn đã trích xuất dữ liệu từ trang web TechCrunch, bạn có thể lưu nó vào tệp CSV.Để làm điều này, bạn có thể sử dụng phương thức `fs.writeFile ()`.Phương thức `fs.WriteFile ()` cho phép bạn ghi dữ liệu vào một tệp.

Mã sau sẽ lưu dữ liệu vào tệp CSV có tên là `TechCrunch.csv`:

`` `
const fs = yêu cầu ('fs');

fs.writefile ('techCrunch.csv', data, (err) => {
if (err) {
Console.Error (ERR);
}
});
`` `

Phương thức `fs.writeFile ()`

[ENGLISH]:
## Building Web Scrapers with Puppeteer + Node.js

Web scraping is the process of extracting data from websites. It can be used for a variety of purposes, such as data analysis, price monitoring, and market research.

Puppeteer is a Node.js library that allows you to control headless Chrome or Chromium browsers. This makes it a powerful tool for web scraping, as you can use it to automate the process of extracting data from websites.

In this tutorial, we will show you how to build a web scraper with Puppeteer and Node.js. We will scrape data from the [TechCrunch](https://techcrunch.com/) website, and we will save the data to a CSV file.

### Prerequisites

To follow this tutorial, you will need the following:

* Node.js v14 or higher
* The Puppeteer library
* A text editor or IDE

### Installing Puppeteer

You can install Puppeteer using the following command:

```
npm install puppeteer
```

### Creating a Web Scraper

To create a web scraper, you will need to create a new Node.js project. You can do this by running the following command:

```
npm init
```

This will create a new directory called `my-project`. Inside this directory, create a new file called `index.js`.

In the `index.js` file, you will need to import the Puppeteer library. You can do this by adding the following line to the top of the file:

```
const puppeteer = require('puppeteer');
```

You will also need to create a function that will start a headless Chrome browser and navigate to the TechCrunch website. You can do this by adding the following code to the `index.js` file:

```
async function scrape() {
// Create a new browser instance
const browser = await puppeteer.launch({
headless: true,
});

// Navigate to the TechCrunch website
const page = await browser.newPage();
await page.goto('https://techcrunch.com/');
}
```

The `scrape()` function will start a headless Chrome browser and navigate to the TechCrunch website. The browser will be headless, which means that it will not display any graphics. This is useful for web scraping, as it allows you to scrape websites without having to worry about the browser's user interface.

### Scraping Data from the TechCrunch Website

Now that you have a web scraper, you can use it to scrape data from the TechCrunch website. To do this, you will need to use the `page.evaluate()` method. The `page.evaluate()` method allows you to run JavaScript code on the page.

You can use the `page.evaluate()` method to get the HTML of the page. You can then use a regular expression to extract the data that you want.

For example, the following code will get the title of the first article on the TechCrunch website:

```
const title = await page.evaluate(() => document.querySelector('.article-title').textContent);
```

The `page.evaluate()` method will return the title of the first article as a string. You can then save the title to a variable.

You can use the same method to extract other data from the TechCrunch website, such as the article's author, publication date, and content.

### Saving the Data to a CSV File

Once you have extracted the data from the TechCrunch website, you can save it to a CSV file. To do this, you can use the `fs.writeFile()` method. The `fs.writeFile()` method allows you to write data to a file.

The following code will save the data to a CSV file called `techcrunch.csv`:

```
const fs = require('fs');

fs.writeFile('techcrunch.csv', data, (err) => {
if (err) {
console.error(err);
}
});
```

The `fs.writeFile()` method

Tips Building Web Scrapers with Puppeteer + Node.js

ductriho

New member

Latest posts