Skip to content
Antonizoon edited this page May 13, 2015 · 13 revisions

The 4archive, one of the few selective public archives of 4chan data, shut down on May 7th, 2015.

Importing a MySQL Dump

Sometimes you might only have access to a SQL database backup. To work with this database, you will have to create a local MySQL/MariaDB server, and import the database backup.

  1. Install MySQL/MariaDB. Some Linux distros may have already migrated to MariaDB, which is a drop-in replacement for MySQL.

  2. Start the MySQLd Service. On Debian/Ubuntu:

    sudo service mysqld start
    
  3. Run the MySQL/MariaDB Secure Installation Script as root, and follow the instructions. Make sure to drop the test database and reload the privilege tables.

    sudo mysql_secure_installation
    
  4. Once you're finished, log in as the root MySQL user.

    mysql -u root -p
    
  5. Create a user to manage the databases. For example, I used user: localuser with password: password.

MariaDB [(none)]> CREATE USER 'localuser'@'localhost' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> GRANT ALL PRIVILEGES ON *.* TO 'localuser'@'localhost'
    ->            WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
  1. Now create a new database to import the backup into. In this example, we call it 4archive. Then quit.
MariaDB [(none)]> create database 4archive;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> quit

Now you can import the database. As the localuser user, we access the localhost SQL server and insert the 4archive_dump.sql dump into the 4archive database.

$ mysql -u localuser -p -h localhost 4archive < 4archive_dump.sql

Convert MySQL to SQLite

The first step was to convert the MySQL Dump into a manageable offline database. I decided to choose SQLite, since it is a portable and simple SQL Database format. (Later on, I might look into CouchDB or MongoDB due to the workings of our chan.zip format).

I imported the MySQL dump into a local MariaDB server (via Linux, of course). Then, following this export guide, I used the Ruby gem sequel to convert to SQLite.

$ gem install sequel mysql sqlite3 
$ sequel mysql://<user>:<password>@<host>/<database> -C sqlite://db.sqlite

Example, for our 4archive database:

$ sequel mysql://localuser:password@localhost/4archive -C sqlite://4archive.sqlite

It saves to the file db.sqlite. Nice and easy.

Convert from Latin1 to UTF-8 Encoding

Well, not yet. MySQL uses Latin1 encoding by default, which has long been superseded by UTF-8 encoding (used in SQLite). This usually isn't an issue for characters within the ASCII set, but accented Latin characters might get corrupted in the process. You should convert the MySQL database to use UTF-8 if possible, or otherwise take this possible bug into account.

This primarily affects the post.original_image_name entry. Since the post.body (containing the actual comment text) is stored as a BLOB, it is not affected.

Wgetting all the Imgur/Imageshack Images

The next step is to obtain all the Imgur/Imageshack images. 4archive was unique in that it hosted no images, instead depending on third-party services to do the hard work.

However, it's a good idea to keep these images archived in the Internet Archive, so we need to make a list of images to scrape. Another good idea is to sort them by chan.zip style folders.

JOIN Statement for Image Scraping

I constructed this SQL JOIN Statement to generate a report of all image urls and accompanying information. Made for SQLite, but should work with MySQL as well.

This statement:

  • Conducts an INNER JOIN between the posts and threads table.
  • Displays the columns board, thread_id, post_id/chan_id, image_url, image_name, and original_image_name.
  • Only displays posts that contain images.
  • Orders by board, then thread_id, then post_id/chan_id, so everything is chronologically sequential.
SELECT threads.board,
	threads.thread_id,
	posts.chan_id AS post_id,
	posts.image_url,
	posts.chan_image_name AS image_name
FROM posts
INNER JOIN threads ON threads.id = posts.threads_id
WHERE AND posts.image_url NOT NULL
ORDER BY threads.board,
	threads.thread_id,
	posts.chan_id;

While we could just make a list of all images in an unsorted fashion, a better method is to put the Imgur/Imageshack links into folders ordered by board/thread/image_urls.txt, as mandated by the chan.arc standard. That way, the images would be automatically sorted into folders when they are downloaded.

We created a Python script to do the hard work of creating the folders and dumping the image_urls to a text file. (e.g. b/21451323/imageurls.txt) Takes only a few minutes to generate all folders and text files.

Downloading the Images

The next step is to actually download the images. Because there is a massive amount of data to scrape,

One major hurdle is that some posts were unable to have their accompanying images archived for whatever reason. Instead of having a NULL value, they have a placeholder "/images/images-404.png" or /images/thread-404.png. We use the regex /images/\w+-404 to match any occurrence of this in the database, and ignore it.

# ignore 404 urls
regex404 = "/images/\w+-404"
if (regex404.match(image_url)):
	continue

Displaying the Threads

The next step was to export all the threads from this database into chan.zip format, which can then be hosted on Github Pages.

JSON Templating

This meant that we finally needed to implement the JSON Templating system, using Jinja2 or such.

BA Logo

Blog Posts

News

Misc

Personal Groundbreaking Projects

Useful Tools

Strange New Devices

Devices that Time Forgot

Refurb Log

Open-Source Hardware

  • Libreboot/Coreboot
  • HPLIP Printers - A massive family of common and cheap printers, that you can probably find from the junkyard. Most of them use open source drivers, and all work out of the box with Linux and HPLIP. Great for printing Bitcoin paper wallets.

Vintage Computing

Experimental Projects

Tips and Tricks

Clone this wiki locally