Package detail

got-scraping

apify245.5kApache-2.04.1.1

HTTP client made for scraping based on got.

readme

Got Scraping

Got Scraping is a small but powerful got extension with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.

Installation

$ npm install got-scraping

The module is now ESM only

This means you have to import it by using an import expression, or the import() method. You can do so by either migrating your project to ESM, or importing got-scraping in an async context

-const { gotScraping } = require('got-scraping');
+import { gotScraping } from 'got-scraping';

If you cannot migrate to ESM, here's an example of how to import it in an async context:

let gotScraping;

async function fetchWithGotScraping(url) {
    gotScraping ??= (await import('got-scraping')).gotScraping;

    return gotScraping.get(url);
}

Note:

  • Node.js >=16 is required due to instability of HTTP/2 support in lower versions.

API

Got scraping package is built using the got.extend(...) functionality, therefore it supports all the features Got has.

Interested what's under the hood?

import { gotScraping } from 'got-scraping';

gotScraping
    .get('https://apify.com')
    .then( ({ body }) => console.log(body));

options

proxyUrl

Type: string

URL of the HTTP or HTTPS based proxy. HTTP/2 proxies are supported as well.

import { gotScraping } from 'got-scraping';

gotScraping
    .get({
        url: 'https://apify.com',
        proxyUrl: 'http://usernamed:password@myproxy.com:1234',
    })
    .then(({ body }) => console.log(body));

useHeaderGenerator

Type: boolean\ Default: true

Whether to use the generation of the browser-like headers.

headerGeneratorOptions

See the HeaderGeneratorOptions docs.

const response = await gotScraping({
    url: 'https://api.apify.com/v2/browser-info',
    headerGeneratorOptions:{
        browsers: [
            {
                name: 'chrome',
                minVersion: 87,
                maxVersion: 89
            }
        ],
        devices: ['desktop'],
        locales: ['de-DE', 'en-US'],
        operatingSystems: ['windows', 'linux'],
    }
});

sessionToken

A non-primitive unique object which describes the current session. By default, it's undefined, so new headers will be generated every time. Headers generated with the same sessionToken never change.

Under the hood

Thanks to the included header-generator package, you can choose various browsers from different operating systems and devices. It generates all the headers automatically so you can focus on the important stuff instead.

Yet another goal is to simplify the usage of proxies. Just pass the proxyUrl option and you are set. Got Scraping automatically detects the HTTP protocol that the proxy server supports. After the connection is established, it does another ALPN negotiation for the end server. Once that is complete, Got Scraping can proceed with HTTP requests.

Using the same HTTP version that browsers do is important as well. Most modern browsers use HTTP/2, so Got Scraping is making a use of it too. Fortunately, this is already supported by Got - it automatically handles ALPN protocol negotiation to select the best available protocol.

HTTP/1.1 headers are always automatically formatted in Pascal-Case. However, there is an exception: x- headers are not modified in any way.

By default, Got Scraping will use an insecure HTTP parser, which allows to access websites with non-spec-compliant web servers.

Last but not least, Got Scraping comes with updated TLS configuration. Some websites make a fingerprint of it and compare it with real browsers. While Node.js doesn't support OpenSSL 3 yet, the current configuration still should work flawlessly.

To get more detailed information about the implementation, please refer to the source code.

Tips

This package can only generate all the standard attributes. You might want to add the referer header if necessary. Please bear in mind that these headers are made for GET requests for HTML documents. If you want to make POST requests or GET requests for any other content type, you should alter these headers according to your needs. You can do so by passing a headers option or writing a custom Got handler.

This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.

Overriding request headers

const response = await gotScraping({
    url: 'https://apify.com/',
    headers: {
        'user-agent': 'test',
    },
});

For more advanced usage please refer to the Got documentation.

JSON mode

You can parse JSON with this package too, but please bear in mind that the request header generation is done specifically for HTML content type. You might want to alter the generated headers to match the browser ones.

const response = await gotScraping({
    responseType: 'json',
    url: 'https://api.apify.com/v2/browser-info',
});

Error recovery

This section covers possible errors that might happen due to different site implementations.

RequestError: Client network socket disconnected before secure TLS connection was established

The error above can be a result of the server not supporting the provided TLS setings. Try changing the ciphers parameter to either undefined or a custom value.

changelog

4.0.7 / 2024/10/23

  • Handles proxy authentication consistently throughout the codebase (solves e.g. this http2-wrapper issue).

4.0.6 / 2024/05/22

  • Logging CONNECT error response body instead of the length only

4.0.5 / 2024/04/03

  • Fixed processing http:// requests over https:// proxies correctly

4.0.4 / 2024/02/16

  • Fixed passing the timeout to the resolveProtocol calls

4.0.3 / 2023/12/11

  • Fixed missing extended types for gotScraping.stream and gotScraping.paginate
  • Fixed general type issues with got-scraping, including not reporting incorrect types for known properties like proxyUrl

4.0.2 / 2023/11/29

  • Fixed runtime exceptions when using got-scraping in a project with older versions of node.js 16

4.0.1 / 2023/11/16

  • Fix compilation errors when this module is used in TypeScript with a project that isn't using Node16/NodeNext module/moduleResolution

4.0.0 / 2023/11/07

  • BREAKING: This module is now ESM only.
    • You will need to either migrate your projects to ESM, or import got-scraping in an async context via await import('got-scraping');
  • Update got to v13

3.1.0 / 2021/08/23

  • Add sessionToken option to persist generated headers

3.0.1 / 2021/08/20

  • Use own proxy agent

3.0.0 / 2021/08/19

  • Switch to TypeScript
  • Enable insecure parser by default
  • Use header-generator to order headers
  • Remove default export in favor of import { gotScraping }
  • Fix leaking ALPN negotiation

2.1.2 / 2021/08/06

  • Mimic got interface

2.1.1 / 2021/08/06

  • Use header-generator v1.0.0

2.1.0 / 2021/08/06

  • Add TransfomHeadersAgent
  • Optimizations
  • Use Got 12
  • docs: fix instances anchor

2.0.2-beta / 2021/08/04

  • Use TransfomHeadersAgent internally to transform headers to Pascal-Case

2.0.1 / 2021/07/22

  • pin http2-wrapper version as the latest was causing random CI failures

2.0.0 / 2021/07/22

  • BREAKING: Require Node.js >=15.10.0 because HTTP2 support on lower Node.js versions is very buggy.
  • Fix various issues by refactoring from got handlers to hooks.

1.0.4 / 2021/05/17

  • HTTP2 protocol resolving fix

1.0.3 / 2021/04/27

  • HTTP2 wrapper fix

1.0.2 / 2021/04/18

  • Fixed TLS

1.0.1 / 2021/04/15

  • Improved ciphers
  • Fixed request payload sending

1.0.0 / 2021/04/07

  • Initial release