|
@@ -4,22 +4,72 @@ Crawl a site to generate knowledge files to create your own custom GPT from one
|
|
|
|
|
|

|
|
|
|
|
|
-
|
|
|
## Example
|
|
|
|
|
|
-[Here is a custom GPT](https://chat.openai.com/g/g-kywiqipmR-builder-io-assistant) that I quickly made to help answer questions about how to use and integrate [Builder.io](https://www.builder.io) by simply providing the URL to the Builder docs.
|
|
|
+[Here is a custom GPT](https://chat.openai.com/g/g-kywiqipmR-builder-io-assistant) that I quickly made to help answer questions about how to use and integrate [Builder.io](https://www.builder.io) by simply providing the URL to the Builder docs.
|
|
|
|
|
|
This project crawled the docs and generated the file that I uploaded as the basis for the custom GPT.
|
|
|
|
|
|
-[Try it out yourself](https://chat.openai.com/g/g-kywiqipmR-builder-io-assistant) by asking questions about how to integrate Builder.io into a site.
|
|
|
+[Try it out yourself](https://chat.openai.com/g/g-kywiqipmR-builder-io-assistant) by asking questions about how to integrate Builder.io into a site.
|
|
|
|
|
|
> Note that you may need a paid ChatGPT plan to access this feature
|
|
|
|
|
|
## Get started
|
|
|
|
|
|
+### Install
|
|
|
+
|
|
|
+```sh
|
|
|
+npm i -g @builder.io/gpt-crawler
|
|
|
+```
|
|
|
+
|
|
|
+### Run
|
|
|
+
|
|
|
+```sh
|
|
|
+gpt-crawler --url https://www.builder.io/c/docs/developers --match https://www.builder.io/c/docs/** --selector .docs-builder-container --maxPagesToCrawl 50 --outputFileName output.json
|
|
|
+```
|
|
|
+
|
|
|
+### Upload your data to OpenAI
|
|
|
+
|
|
|
+The crawl will generate a file called `output.json` at the root of this project. Upload that [to OpenAI](https://platform.openai.com/docs/assistants/overview) to create your custom assistant or custom GPT.
|
|
|
+
|
|
|
+#### Create a custom GPT
|
|
|
+
|
|
|
+Use this option for UI access to your generated knowledge that you can easily share with others
|
|
|
+
|
|
|
+> Note: you may need a paid ChatGPT plan to create and use custom GPTs right now
|
|
|
+
|
|
|
+1. Go to [https://chat.openai.com/](https://chat.openai.com/)
|
|
|
+2. Click your name in the bottom left corner
|
|
|
+3. Choose "My GPTs" in the menu
|
|
|
+4. Choose "Create a GPT"
|
|
|
+5. Choose "Configure"
|
|
|
+6. Under "Knowledge" choose "Upload a file" and upload the file you generated
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+#### Create a custom assistant
|
|
|
+
|
|
|
+Use this option for API access to your generated knowledge that you can integrate into your product.
|
|
|
+
|
|
|
+1. Go to [https://platform.openai.com/assistants](https://platform.openai.com/assistants)
|
|
|
+2. Click "+ Create"
|
|
|
+3. Choose "upload" and upload the file you generated
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+## (Alternate method) Running in a container with Docker
|
|
|
+
|
|
|
+To obtain the `output.json` with a containerized execution. Go into the `containerapp` directory. Modify the `config.ts` same as above, the `output.json`file should be generated in the data folder. Note : the `outputFileName` property in the `config.ts` file in containerapp folder is configured to work with the container.
|
|
|
+
|
|
|
+## Contributing
|
|
|
+
|
|
|
+Know how to make this project better? Send a PR!
|
|
|
+
|
|
|
+## Get started developing
|
|
|
+
|
|
|
### Prerequisites
|
|
|
|
|
|
-Be sure you have Node.js >= 16 installed
|
|
|
+Be sure you have Node.js >= 16 installed along with [bun](https://bun.sh/)
|
|
|
|
|
|
### Clone the repo
|
|
|
|
|
@@ -30,17 +80,12 @@ git clone https://github.com/builderio/gpt-crawler
|
|
|
### Install Dependencies
|
|
|
|
|
|
```sh
|
|
|
-npm i
|
|
|
-```
|
|
|
-
|
|
|
-If you do not have Playwright installed:
|
|
|
-```sh
|
|
|
-npx playwright install
|
|
|
+bun i
|
|
|
```
|
|
|
|
|
|
-### Configure the crawler
|
|
|
+### Running GPT Crawler with a hardcoded configuration file
|
|
|
|
|
|
-Open [config.ts](config.ts) and edit the `url` and `selectors` properties to match your needs.
|
|
|
+Open [hardcoded.ts](./src/hardcoded.ts) and edit the `url`, `match` and `selectors` properties to match your needs.
|
|
|
|
|
|
E.g. to crawl the Builder.io docs to make our custom GPT you can use:
|
|
|
|
|
@@ -54,7 +99,7 @@ export const config: Config = {
|
|
|
};
|
|
|
```
|
|
|
|
|
|
-See the top of the file for the type definition for what you can configure:
|
|
|
+See the top of the [config.ts](./config.ts) file for the type definition for what you can configure:
|
|
|
|
|
|
```ts
|
|
|
type Config = {
|
|
@@ -68,62 +113,22 @@ type Config = {
|
|
|
maxPagesToCrawl: number;
|
|
|
/** File name for the finished data */
|
|
|
outputFileName: string;
|
|
|
- /** Optional cookie to be set. E.g. for Cookie Consent */
|
|
|
- cookie?: {name: string; value: string}
|
|
|
/** Optional function to run for each page found */
|
|
|
onVisitPage?: (options: {
|
|
|
page: Page;
|
|
|
pushData: (data: any) => Promise<void>;
|
|
|
}) => Promise<void>;
|
|
|
- /** Optional timeout for waiting for a selector to appear */
|
|
|
- waitForSelectorTimeout?: number;
|
|
|
+ /** Optional timeout for waiting for a selector to appear */
|
|
|
+ waitForSelectorTimeout?: number;
|
|
|
};
|
|
|
```
|
|
|
|
|
|
-### Run your crawler
|
|
|
+#### Run your crawler
|
|
|
|
|
|
```sh
|
|
|
-npm start
|
|
|
+bun start
|
|
|
```
|
|
|
|
|
|
-### Upload your data to OpenAI
|
|
|
-
|
|
|
-The crawl will generate a file called `output.json` at the root of this project. Upload that [to OpenAI](https://platform.openai.com/docs/assistants/overview) to create your custom assistant or custom GPT.
|
|
|
-
|
|
|
-#### Create a custom GPT
|
|
|
-
|
|
|
-Use this option for UI access to your generated knowledge that you can easily share with others
|
|
|
-
|
|
|
-> Note: you may need a paid ChatGPT plan to create and use custom GPTs right now
|
|
|
-
|
|
|
-1. Go to [https://chat.openai.com/](https://chat.openai.com/)
|
|
|
-2. Click your name in the bottom left corner
|
|
|
-3. Choose "My GPTs" in the menu
|
|
|
-4. Choose "Create a GPT"
|
|
|
-5. Choose "Configure"
|
|
|
-6. Under "Knowledge" choose "Upload a file" and upload the file you generated
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-#### Create a custom assistant
|
|
|
-
|
|
|
-Use this option for API access to your generated knowledge that you can integrate into your product.
|
|
|
-
|
|
|
-1. Go to [https://platform.openai.com/assistants](https://platform.openai.com/assistants)
|
|
|
-2. Click "+ Create"
|
|
|
-3. Choose "upload" and upload the file you generated
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-## (Alternate method) Running in a container with Docker
|
|
|
-To obtain the `output.json` with a containerized execution. Go into the `containerapp` directory. Modify the `config.ts` same as above, the `output.json`file should be generated in the data folder. Note : the `outputFileName` property in the `config.ts` file in containerapp folder is configured to work with the container.
|
|
|
-
|
|
|
-
|
|
|
-## Contributing
|
|
|
-
|
|
|
-Know how to make this project better? Send a PR!
|
|
|
-
|
|
|
<br>
|
|
|
<br>
|
|
|
|