How to archive and deploy site using HTTrack

Monday, September 20, 2021

Scrapping tools

The scrapping tool that will be used would be HTTrack to archive the site. It is a command-line utility for accessing websites and downloading content from the web. It is typically used for mirroring websites to create their copy on your local machine. In order to install HTTrack, visit their download page and follow the instructions to install as per your OS.

Deployment

For high speed archival, the tool can be deployed on an EC2 instance with an S3 Access policy with a large storage capacity. The data can be stored in a specific region as per data privacy laws. For Ubuntu on AWS, you can simply install it by running the following command in the terminal:

$ sudo apt-get install webhttrack

run the command:

$ httrack

This terminal will provide options to select the URL to be mirrored. enter the URL and additional config to start the archival process. Select the location to be copied locally. This folder will later be used to sync AWS S3.

Data storage

The temporary storage of copied websites will be stored on instances until the process is complete. Thereafter, the folder would be synced to the S3 storage bucket with an access policy on the specific bucket. Once the process is completed all the archival data will be permanently stored in S3 and after the shutdown of the old site, the EC2 instance would be terminated.

There are few storage options for archival data:

S3 Standard: General Storage option with 99.999999999% durability. This is a default option

  • Low latency and high throughput performance
  • Designed for durability of 99.999999999% of objects across multiple Availability Zones
  • Resilient against events that impact an entire Availability Zone
  • Designed for 99.99% availability over a given year
  • Backed with the Amazon S3 Service Level Agreement for availability
  • Supports SSL for data in transit and encryption of data at rest
  • S3 Lifecycle management for automatic migration of objects to other S3 Storage Classes

S3 Intelligent-Tiering: This is the only cloud storage class that delivers automatic storage cost savings when data access patterns change, without performance impact or operational overhead.

  • Stores objects in two access tiers, optimized for frequent and infrequent access
  • Frequent and Infrequent Access tiers have the same low latency and high throughput performance of S3 Standard
  • Activate optional automatic asynchronous archive capabilities for objects that become rarely accessed
  • Archive access and deep Archive access tiers have the same performance as Glacier and Glacier Deep Archive
  • Designed for durability of 99.999999999% of objects across multiple Availability Zones
  • Designed for 99.9% availability over a given year
  • Backed with the Amazon S3 Service Level Agreement for availability
  • Small monthly monitoring and auto-tiering charge
  • No operational overhead, no retrieval charges, no additional tiering charges apply when objects are moved - between access tiers within the S3 Intelligent-Tiering storage class
  • No minimum storage duration

Standard-Infrequent Access: This is for data that is accessed less frequently, but requires rapid access when needed. S3 Standard-IA offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval charge.

  • Same low latency and high throughput performance of S3 Standard
  • Designed for durability of 99.999999999% of objects across multiple Availability Zones
  • Resilient against events that impact an entire Availability Zone
  • Data is resilient in the event of one entire Availability Zone destruction
  • Designed for 99.9% availability over a given year
  • Backed with the Amazon S3 Service Level Agreement for availability
  • Supports SSL for data in transit and encryption of data at rest
  • S3 Lifecycle management for automatic migration of objects to other S3 Storage Classes

S3 Glacier: This is a secure, durable, and low-cost storage class for data archiving.

  • Designed for durability of 99.999999999% of objects across multiple Availability Zones
  • Data is resilient in the event of one entire Availability Zone destruction
  • Supports SSL for data in transit and encryption of data at rest
  • Low-cost design is ideal for long-term archive
  • Configurable retrieval times, from minutes to hours
  • S3 PUT API for direct uploads to S3 Glacier and S3 Lifecycle management for automatic migration of objects

Choose the option required by the organisation needs.

Organization and accessibility

  • S3 Static Site Hosting with Route 53
  • S3 Bucket Access Permissions
  • AWS IAM Roles

No comments: