Deploy and use JuiceFS to store data on Amazon AWS

JuiceFS
9 min readOct 20, 2021

--

JuiceFS is an open source enterprise distributed file system that uses object storage and database as the storage layer and supports almost all object storage services as well as databases such as Redis, MySQL, PostgreSQL, TiKV and so on. Any file deposited into JuiceFS is split into data blocks according to specific rules and stored in the object storage, and the corresponding metadata is stored in a separate database. There are no geographical or platform restrictions, and any server with access to the object storage and database can mount and use the storage via the JuiceFS client.

JuiceFS provides a variety of access interfaces including POSIX, Java SDK, CSI Driver, S3 Gateway, etc. From standard operating systems, Hadoop ecosystem, Kubernetes container platform to web applications, all can seamlessly interface to use JuiceFS to persistent data. Simply put, JuiceFS reliably connects massive cloud storage to local, providing nearly unlimited storage space. For systems and applications, using JuiceFS storage is indistinguishable from using local disk.

Requirements

Amazon AWS is the world’s leading cloud computing platform, offering almost all types of cloud computing services. Thanks to the rich product line of AWS, users can choose JuiceFS components in a very flexible way.

As you can see from the previous architecture, JuiceFS consists of the following three components:

  1. a JuiceFS client installed on the server
  2. the object storage used to store data
  3. a database for storing metadata

1. Servers

Amazon EC2 Cloud Server is one of the most basic and widely used cloud services on the AWS platform. It offers more than 400 instance sizes and 81 availability zones in 25 data centers around the world, giving users the flexibility to choose and adjust the configuration of EC2 instances according to their actual needs.

For new users, you don’t need to think too much about JuiceFS configuration requirements, because even the least configured EC2 instances can be easily created and mounted to use JuiceFS storage. Usually, you only need to consider the hardware requirements of your business system.

JuiceFS clients will occupy 1GB of disk as cache by default. When dealing with a large number of files, the client will cache the data on disk first and then upload it to the object storage asynchronously. Choosing a disk with higher IO and reserving and setting a larger cache will allow JuiceFS to have better performance.

2. Object Storage

Amazon S3 is the de facto standard for public cloud object storage services, and the object storage services provided by other major cloud platforms are usually compatible with the S3 API, which allows programs developed for S3 to freely switch between object storage services of other platforms.

JuiceFS fully supports Amazon S3 and all S3-like object storage services, and you can see the documentation for all storage types supported by JuiceFS setup_object_storage.md).

Amazon S3 offers a range of storage classes suitable for different use cases, the main ones being

  • Amazon S3 STANDARD: general-purpose storage for frequently accessed data
  • Amazon S3 STANDARD_IA: for data that is needed for a long time but accessed less frequently
  • S3 Glacier: for data that is archived over time

The standard type of S3 should usually be used for JuiceFS, because other types than the standard type are less expensive but incur additional costs when retrieving data.

In addition, access to the object storage service requires user authentication via access key and secret key, which you can refer to the document Controlling Access to Storage Buckets with User Policies userguide/walkthrough1.html) to create it. When accessing S3 through EC2 cloud server, you can also assign IAM role to EC2 to enable key-free invocation of S3 API on EC2.

3. Database

The ability of data and metadata to be accessed by multiple hosts is key to a distributed file system, and in order for the metadata information generated by JuiceFS to be accessible via Internet requests like S3, the database for storing metadata should also be chosen as a network-oriented database.

Amazon RDS and ElastiCache are two cloud database services provided by AWS, both of which can be directly used for metadata storage in JuiceFS. Amazon RDS is a relational database that supports various engines such as MySQL, MariaDB, PostgreSQL, etc. ElastiCache is a memory-based caching cluster service which has two engines, the Redis engine is suite for JuiceFS.

In addition, you can also build your own database on EC2 cloud server for JuiceFS to store metadata.

4. Cautions

  • JuiceFS is not business invasive and will not affect the normal operation of existing systems.
  • When selecting cloud services, it is recommended to select all cloud services in the same region, which is equivalent to all services being on the same intranet, with the lowest latency and fastest inter-access. And, according to AWS billing rules, it is free to transfer data between basic cloud services in the same region. In other words, when you select cloud services in different regions, for example, EC2 is selected in ap-east-1, ElastiCache is selected in ap-southeast-1, and S3 is selected in us-east-2, the inter-access between each cloud service in this case will incur traffic charges.
  • JuiceFS does not require the use of object storage and databases from the same cloud platform; you can flexibly mix and match cloud services from different platforms as needed. For example, you can use EC2 to run JuiceFS client with AliCloud’s Redis database and Backbalze B2 object storage. Of course, JuiceFS storage composed of cloud services on the same platform and in the same region will perform even better.

Deployment and Usage

Next, we briefly describe how to install and use JuiceFS using the ElastiCache cluster with EC2 cloud server, S3 object storage and Redis engine in the same region as an example.

1. Install the client

Here we are using a Linux system with x64 bit architecture. Execute the following commands, the latest version of JuiceFS client will be downloaded.

$ JFS_LATEST_TAG=$(curl -s https://api.github.com/repos/juicedata/juicefs/releases/latest | grep 'tag_name' | cut -d '"' -f 4 | tr -d 'v')$ wget "https://github.com/juicedata/juicefs/releases/download/v${JFS_LATEST_TAG}/juicefs-${JFS_LATEST_TAG}-linux-amd64.tar.gz"

After downloading, unzip the program into the juice folder.

$ mkdir juice && tar -zxvf "juicefs-${JFS_LATEST_TAG}-linux-amd64.tar.gz" -C juice

Install the JuiceFS client to the system $PATH, e.g., /usr/local/bin.

$ sudo install juice/juicefs /usr/local/bin

Execute the command and see the returned help message, which means the client installation is successful.

$ juicefs
NAME:
juicefs - A POSIX file system built on Redis and object storage.

USAGE:
juicefs [global options] command [command options] [arguments...]

VERSION:
0.17.0 (2021-09-24T04:17:26Z e115dc4)

COMMANDS:
format format a volume
mount mount a volume
umount unmount a volume
gateway S3-compatible gateway
sync sync between two storage
rmr remove directories recursively
info show internal information for paths or inodes
bench run benchmark to read/write/stat big/small files
gc collect any leaked objects
fsck Check consistency of file system
profile analyze access log
stats show runtime statistics
status show status of JuiceFS
warmup build cache for target directories/files
dump dump metadata into a JSON file
load load metadata from a previously dumped JSON file
help, h Shows a list of commands or help for one command

GLOBAL OPTIONS:
--verbose, --debug, -v enable debug log (default: false)
--quiet, -q only warning and errors (default: false)
--trace enable trace log (default: false)
--no-agent Disable pprof (:6060) and gops (:6070) agent (default: false)
--help, -h show help (default: false)
--version, -V print only the version (default: false)

COPYRIGHT:
AGPLv3

Hint: If you execute the juicefs command and the terminal returns command not found, it may be because the /usr/local/bin directory is not in the system's PATH executable path. You can use the echo $PATH command to check the system's set executable path and reinstall the client to the correct location. You can also add /usr/local/bin to the PATH.

JuiceFS has good cross-platform compatibility and is supported on both Linux, Windows and macOS. If you need to know how to install it on other systems, please check the official documentation.

3. Create File System

The format subcommand of the JuiceFS client is used to create (format) the file system, here we use S3 as the data store and ElastiCache as the metadata store, install the client on EC2 and create the JuiceFS file system with the following command format.

$ juicefs format \
--storage s3 \
--bucket https://<bucket>.s3.<region>.amazonaws.com \
--access-key <access-key-id> \
--secret-key <access-key-secret> \
redis://[<redis-username>]:<redis-password>@<redis-url>:6379/1 \
mystor

Option Description:

  • --storage: Specify the type of object storage, here we use S3. For other object storage, please refer to the JuiceFS Supported Object Storage and Setup Guide.
  • --bucket: Bucket domain for object storage.
  • --access-key and --secret-key: The secret key pair to access the S3 API.

For Redis 6.0 and above, authentication requires both username and password, and the address format is redis://username:password@redis-server-url:6379/1. For Reids 4.0 and 5.0, authentication requires only the password, and the username needs to be left blank when setting the Redis server address. For example: redis://:password@redis-server-url:6379/1

When using the IAM role to bind to EC2, you only need to specify --storage and --bucket options, and do not need to provide the API access key. It is also possible to assign ElastiCache access to the IAM role, and then instead of providing Redis authentication information, you can simply enter the Redis URL, which can be rewritten as

$ juicefs format \
--storage s3 \
--bucket https://herald-demo.s3.<region>.amazonaws.com \
redis://herald-demo.abcdefg.0001.apse1.cache.amazonaws.com:6379/1 \
mystor

Seeing output like the following means that the file system was created successfully.

2021/10/14 08:38:32.211044 juicefs[10391] <INFO>: Meta address: redis://herald-demo.abcdefg.0001.apse1.cache.amazonaws.com:6379/1
2021/10/14 08:38:32.216566 juicefs[10391] <INFO>: Ping redis: 383.789µs
2021/10/14 08:38:32.216915 juicefs[10391] <INFO>: Data use s3://herald-demo/mystor/
2021/10/14 08:38:32.412112 juicefs[10391] <INFO>: Volume is formatted as {Name:mystor UUID:21a2cafd-f5d8-4a76-ae4d-482c8e2d408d Storage:s3 Bucket:https://herald-demo.s3.ap-southeast-1.amazonaws.com AccessKey: SecretKey: BlockSize:4096 Compression:none Shards:0 Partitions:0 Capacity:0 Inodes:0 EncryptKey:}

4. Mount the file system

The process of creating the file system will store the object store including API keys into the database, so you do not need to input the bucket domain and the secret key of the object storage when mounting.

Use the mount subcommand of the JuiceFS client to mount the file system to the /mnt/jfs directory.

$ sudo juicefs mount -d redis://[<redis-username>]:<redis-password>@<redis-url>:6379/1  /mnt/jfs

Note: When mounting the file system, only the database address is required, not the file system name. The default cache path is /var/jfsCache, please make sure the current user has enough read/write permissions.

You can optimize JuiceFS by adjusting the mount parameter, for example by -- cache-size to change the cache to 20GB.

$ sudo juicefs mount --cache-size 20480 -d redis://herald-demo.abcdefg.0001.apse1.cache.amazonaws.com:6379/1  /mnt/jfs

Seeing output like the following means the file system was mounted successfully.

2021/10/14 08:47:49.623814 juicefs[10601] <INFO>: Meta address: redis://herald-demo.abcdefg.0001.apse1.cache.amazonaws.com:6379/1
2021/10/14 08:47:49.628157 juicefs[10601] <INFO>: Ping redis: 426.127µs
2021/10/14 08:47:49.628941 juicefs[10601] <INFO>: Data use s3://herald-demo/mystor/
2021/10/14 08:47:49.629198 juicefs[10601] <INFO>: Disk cache (/var/jfsCache/21a2cafd-f5d8-4a76-ae4d-482c8e2d408d/): capacity (20480 MB), free ratio (10%), max pending pages (15)
2021/10/14 08:47:50.132003 juicefs[10601] <INFO>: OK, mystor is ready at /mnt/jfs

Using the df command, you can see how the filesystem is mounted.

$ df -Th
File system type capacity used usable used% mount point
JuiceFS:mystor fuse.juicefs 1.0P 64K 1.0P 1% /mnt/jfs

Once mounted, it can be used like a local disk, and the data stored in the /mnt/jfs directory is coordinated by the JuiceFS client and eventually stored in the S3 object store.

Multi-Host Sharing: JuiceFS supports being mounted by multiple hosts at the same time, you can install the JuiceFS client on any cloud server on any other platform using redis://:<your-redis-password>@herald-sh-abc.redis.rds.aliyuncs.com:6379/1 The database address can be shared by mounting the filesystem, but you need to make sure that the host on which the filesystem is mounted has proper access to the database and the S3 used with it.

5. Uninstall JuiceFS Storage

The file system can be unmounted using the umount command provided by the JuiceFS client, e.g.

$ sudo juicefs umount /mnt/jfs

Note: Forced unmount of the file system in use may result in data corruption or loss, so please be sure to proceed with caution. For more information, please refer to the official documentation.

6. Auto-mount on boot

If you don’t want to re-mount JuiceFS storage manually every time you reboot your system, you can set up an automatic mount.

First, you need to rename the juicefs client to mount.juicefs and copy it to the /sbin/ directory.

$ sudo cp juice/juicefs /sbin/mount.juicefs

Edit the /etc/fstab configuration file and add a new record.

redis://[<redis-username>]:<redis-password>@<redis-url>:6379/1    /mnt/jfs       juicefs     _netdev,cache-size=20480     0  0

The mount option cache-size=20480 means to allocate 20GB local disk space for JuiceFS cache, please decide the allocated cache size based on your actual EBS disk capacity.

You can adjust the FUSE mount options in the above configuration as needed, for more details please check the documentation.

Note: Please replace the Redis address, mount point, and mount options in the above configuration file with your actual information.

Summary

This article provides a complete introduction to the deployment and usage of JuiceFS on AWS from architecture to usage, which is a valuable reference for users who need to expand storage space for applications on the cloud or need elastic storage space for data backup, archiving, and disaster recovery.

In addition to the use on standard operating systems introduced in this article, JuiceFS also supports mounting and use on Hadoop Big Data ecosystem and Kubernetes container orchestration platform, which will be specifically introduced in subsequent articles.

--

--

JuiceFS

JuiceFS(https://github.com/juicedata/juicefs) is a distributed POSIX file system built on top of Redis and S3.