Anonymous b207aaefc0 tab-yoinker initial commit

2021-09-23 23:15:03 -05:00

8.3 KiB

Raw Blame History

ultimate-guitar.com Tab Scraper

This file set allows for a 6 step process to scrape the tabs off ultimate-guitar.com.

It takes advantage of a "feature" in the ultimate-guitar.com's rendering techniques that puts website data inside of a div with class '.js-store'.

Scraping all 1.1 million public tabs from the site is pretty easy and can be done in 6 steps.

You're going to want a VPN for this because you will get kicked off and IP blocked every 2-8 hours (depending on how agressive you are when scraping). Reconnect to another IP and you'll be good to continue scraping.

Download the sqlite3 command line client from https://sqlite.org/download.html Download Node.js from https://nodejs.org/

1. Scrape Tab URLs

This step maps out all pages on ultimate-guitar.com that can be scraped

Enter 01-scraper-urls

Run

npm install

Open up 01-scrape-bands.js and customize the band list links

Run

node 01-scrape-bands.js

This script will save the artist data to output/artists/*.json

Open up 02-scrape-artist-tab-urls.js and customize the artist file list

Run

node 02-scrape-artist-tab-urls.js

This script will add tab information to the artist data and save it to output/artists-with-tabs/*.json

2. Injest scraped urls into a sqlite database

This step converts the .json files into a sqlite database to allow scraping to be paused and restarted easily.

Copy 01-scraper-urls/output/artists-with-tabs/*.json into 02-injest-sqlite/input/

Run

npm install node 01-injest-sqlite.js

Note: this script queues up the artist inserts into the database and then waits for the inserts to finish. Don't be surprised if it hangs for a few hours (it took 6 hours on my SSD).

This script will create a sqlite database, output/tabs-no-text.db

3. (Optional) Split the sqlite database for parallelized scraping

This step splits the generated sqlite database into multiple databases so that you can more easily use multiple machines to scrape the site.

Copy the tabs-no-text.db into 03-splitter-sqlite/

Determine the number of machines that you want to run the scraper on. Call this number N

Open up tabs-no-text.db

sqlite3 tabs-no-text.db

Create a view that separates the tabs into buckets

CREATE VIEW tabs_bucketed AS SELECT *, NTILE(N) OVER (ORDER BY rowid) AS bucket FROM tabs;

For each machine, i:

Create a new database

sqlite3 tabs-i-no-text.db

Attach to the base database

ATTACH 'tabs-no-text.db' AS db2;

Create a table with the rows from the machine's bucket

CREATE TABLE tabs AS SELECT * FROM db2.tabs_bucketed WHERE bucket=i;

Make sure to hold onto your tabs-no-text.db database for the merging process.

4. Scrape the tabs

For each machine, i

Copy the machine's tabs database into 04-scraper-tabs/input/tabs.db

Important: Make sure to change the name of the database file to tabs.db OR the file name in 01-scrape.js:100 - filename: './input/tabs-laptop.db'.

4.1 (Recommended) Index your database

It's highly recommended to index your database

Open up 04-scraper-tabs/input/

Run

sqlite3 tabs.db

Run the following SQL queries CREATE INDEX IF NOT EXISTS tabs_scrape_id_idx ON tabs (scrape_id); CREATE INDEX IF NOT EXISTS tabs_tab_url_idx ON tabs (tab_url); CREATE INDEX IF NOT EXISTS tabs_type_name_idx ON tabs (type_name);

4.2 (Optional) Optimize the tab scraper

Customize 01-scrape-tabs.js

Key Lines:

Line 192: let queue = new ConcurrentQueue(5);

- Increasing this value will increase the number of concurrent requests sent.
- Note: Higher concurrent request counts result in more agressive scrapes that may run more quickly but also get you kicked off more quickly

Line 214: await sleep(100);

- Increasing this value (in ms) will increase the delay between the first few concurrent requests. This staggers the requests, potentially reducing the chance you get kicked off.

Line 183: LIMIT 300

- Increasing the value in this line will increase the number of tabs scraped from the database at a time before sending a status update. Lower values will query the database more but give more frequent status updates. Higher values will take up more space in process memory and give less frequent status updates.
- If this value is set too low, removed tab urls will likely fill up the result set, causing the program to incorrectly detect that it got kicked off.

I found that running 500 tabs/minute gave me a good balance in effort spent reconnecting to VPN and scraping speed. Typically I would have to reset the scraper every 4-6 hours with this rate.

I got this with concurrency=5 and sleep=100.

4.4 Add required columns for scraping

Open your tabs.db in sqlite

sqlite3 tabs.db

Run the following commands to add the needed columns:

ALTER TABLE tabs DROP COLUMN tab_text; ALTER TABLE tabs ADD COLUMN user_id INTEGER; ALTER TABLE tabs ADD COLUMN user_iq INTEGER; ALTER TABLE tabs ADD COLUMN username TEXT; ALTER TABLE tabs ADD COLUMN tab_text TEXT;

4.3 Scrape the tabs

Important: Make sure to change the name of the database file to tabs.db OR the file name in 01-scrape.js:100 - filename: './input/tabs-laptop.db'.

Run

npm install node 01-scrape-tabs.js

This will add tab information to the tabs.db database.

Note: This scraper only works for the following tab types:

Bass
Chords
Drums
Tab
Ukulele

The following tab types are not supported:

Guitar Pro
Official
Power
Video

5. (Optional) Merge the sqlite databases

Move the partial tabs databases from each machine to 05-merger-sqlite/input/{tabs-i.db}

Note: i is the machine number from before

Move the tabs-no-text.db database from step 2 into 05-merger-sqlite/input/.

Open up tabs-no-text.db

sqlite3 tabs-no-text.db

Create an index on tabs.tab_url

CREATE INDEX tabs_tab_url_idx ON tabs (tab_url);

Open up the final tabs database, tabs-full.db

sqlite3 tabs-full.db

Attach the no-text database

ATTACH 'tabs-no-text.db' AS 'dbnt';

Attach each machine database in the following format:

ATTACH 'tabs-i.db' AS 'dbi';

Create a the final tabs table CREATE TABLE tabs ( scrape_id INTEGER , artist_scrape_id INTEGER NOT NULL , id INTEGER , song_id INTEGER , song_name TEXT , artist_id INTEGER , artist_name INTEGER , type TEXT , part TEXT , version INTEGER , votes INTEGER , rating NUMERIC , date TEXT , status TEXT , preset_id INTEGER , tab_access_type TEXT , tp_version INTEGER , tonality_name TEXT , version_description TEXT , verified INTEGER , artist_url TEXT , tab_url TEXT , difficulty TEXT , tuning TEXT , type_name TEXT , user_id INTEGER , user_iq INTEGER , username TEXT , tab_text TEXT );

For each machine, insert its respective tabs into the table INSERT INTO tabs SELECT tabsnt.scrape_id , tabsnt.artist_scrape_id , tabsnt.id , tabsnt.song_id , tabsnt.song_name , tabsnt.artist_id , tabsnt.artist_name , tabsnt.type , tabsnt.part , tabsnt.version , tabsnt.votes , tabsnt.rating , tabsnt.date , tabsnt.status , tabsnt.preset_id , tabsnt.tab_access_type , tabsnt.tp_version , tabsnt.tonality_name , tabsnt.version_description , tabsnt.verified , tabsnt.artist_url , tabsnt.tab_url , tabsnt.difficulty , tabsnt.tuning , tabsnt.type_name , tabsm.user_id , tabsm.user_iq , tabsm.username , tabsm.tab_text FROM dbnt.tabs AS tabsnt JOIN dbi.tabs AS tabsm ON tabsnt.tab_url=tabsm.tab_url WHERE tabsm.tab_url IS NOT NULL AND tabsm.tab_text IS NOT NULL;

Note: this command can take a bit to complete (30s-2m) depending on how large your databases are.

6. Print the contents of the database into organized text files

Copy your filled tabs database to 06-output-generator/input/tabs-full.db

Note: Make sure you either rename it in the directory or update 01-output-generator.js:52 with the proper file name

Run

npm install node --max-old-space-size=16384 01-output-generator.js

Note: depending on how many tabs you scraped, you may have to increase the max-old-space-size (Max RAM). This example uses 16GB of ram.

I'm suspicious the memory leak is in the sqlite package >:|

Congratulations! Your guitar tabs are now organized in: 06-output-generator/output/{type}/{artist}-{artist_id}/{song}.txt

Other Information

You can customize the output generator's file output by modifying the fileText variable in 01-output-gernerator.js:84-99

Note: The .keep files can be ignored/deleted. They are kept to keep the default directory structure in git.

8.3 KiB Raw Blame History