Gaurav Panchal's Blog: How to archive and deploy site using HTTrack

Scrapping tools

The scrapping tool that will be used would be HTTrack to archive the site. It is a command-line utility for accessing websites and downloading content from the web. It is typically used for mirroring websites to create their copy on your local machine. In order to install HTTrack, visit their download page and follow the instructions to install as per your OS.

Deployment

For high speed archival, the tool can be deployed on an EC2 instance with an S3 Access policy with a large storage capacity. The data can be stored in a specific region as per data privacy laws. For Ubuntu on AWS, you can simply install it by running the following command in the terminal:

$ sudo apt-get install webhttrack

run the command:

$ httrack

This terminal will provide options to select the URL to be mirrored. enter the URL and additional config to start the archival process. Select the location to be copied locally. This folder will later be used to sync AWS S3.

Data storage

The temporary storage of copied websites will be stored on instances until the process is complete. Thereafter, the folder would be synced to the S3 storage bucket with an access policy on the specific bucket. Once the process is completed all the archival data will be permanently stored in S3 and after the shutdown of the old site, the EC2 instance would be terminated.

There are few storage options for archival data:

S3 Standard: General Storage option with 99.999999999% durability. This is a default option

Low latency and high throughput performance
Designed for durability of 99.999999999% of objects across multiple Availability Zones
Resilient against events that impact an entire Availability Zone
Designed for 99.99% availability over a given year
Backed with the Amazon S3 Service Level Agreement for availability
Supports SSL for data in transit and encryption of data at rest
S3 Lifecycle management for automatic migration of objects to other S3 Storage Classes

S3 Intelligent-Tiering: This is the only cloud storage class that delivers automatic storage cost savings when data access patterns change, without performance impact or operational overhead.

Stores objects in two access tiers, optimized for frequent and infrequent access
Frequent and Infrequent Access tiers have the same low latency and high throughput performance of S3 Standard
Activate optional automatic asynchronous archive capabilities for objects that become rarely accessed
Archive access and deep Archive access tiers have the same performance as Glacier and Glacier Deep Archive
Designed for durability of 99.999999999% of objects across multiple Availability Zones
Designed for 99.9% availability over a given year
Backed with the Amazon S3 Service Level Agreement for availability
Small monthly monitoring and auto-tiering charge
No operational overhead, no retrieval charges, no additional tiering charges apply when objects are moved - between access tiers within the S3 Intelligent-Tiering storage class
No minimum storage duration

Standard-Infrequent Access: This is for data that is accessed less frequently, but requires rapid access when needed. S3 Standard-IA offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval charge.

Same low latency and high throughput performance of S3 Standard
Designed for durability of 99.999999999% of objects across multiple Availability Zones
Resilient against events that impact an entire Availability Zone
Data is resilient in the event of one entire Availability Zone destruction
Designed for 99.9% availability over a given year
Backed with the Amazon S3 Service Level Agreement for availability
Supports SSL for data in transit and encryption of data at rest
S3 Lifecycle management for automatic migration of objects to other S3 Storage Classes

S3 Glacier: This is a secure, durable, and low-cost storage class for data archiving.

Designed for durability of 99.999999999% of objects across multiple Availability Zones
Data is resilient in the event of one entire Availability Zone destruction
Supports SSL for data in transit and encryption of data at rest
Low-cost design is ideal for long-term archive
Configurable retrieval times, from minutes to hours
S3 PUT API for direct uploads to S3 Glacier and S3 Lifecycle management for automatic migration of objects

Choose the option required by the organisation needs.

Organization and accessibility

S3 Static Site Hosting with Route 53
S3 Bucket Access Permissions
AWS IAM Roles

Gaurav Panchal's Blog

How to archive and deploy site using HTTrack

Monday, September 20, 2021

Scrapping tools

Deployment

Data storage

Organization and accessibility

No comments: