Computation on nodes
MooseFS instance can share hardware with another software. Since chunkservers are used for data storage and require minimum computational power, it makes sense to use the machines for some other purpose. For example for computation of data (e.g. for image analysis, rendering and animation, AI learning, scientific research, and similar).
Storage policies for data
Storage policy for data in such an instance will depend on what kind of data is stored: is it temporary or pemanent, does the computation result in a lot of changes or is it mainly reading and analysing?
For data that is stored once and then only read, EC format is a great solution, as it saves a lot of space. For example this class:
mfsscadmin create -K 3* -A@8+2 -o f my_permanent_data
will store data in 3 copies when it's initially written, and then convert it to EC format as soon as possible, still with the same redundancy level, but much lower overhead (25% instead of 200%).
For data that is temporary and not usable if, for example, a computation is interrupted due to some error, it's best to store it in copy format with low redundancy (2 copies or even 1 copy):
mfsscadmin create -K 2* my_temporary_data
or
mfsscadmin create -K * my_very_temporary_data
For data, that is modified by computation, but then the final result is archived and kept for further reference, a storage class that initially keeps the data in copy format and archives it to EC format after a longer time from last modification time has passed is recommended:
mfsscadmin create -K 3* -A@8+2 -o c -d 1d my_research_data
The switches -o c -d 1d mean that the data should be converted to archive state (for which EC format is defined in this class) after 1 day (1d) has passed from last modification date (c for ctime). You can modify this to suit your needs: make the delay shorter (minimum is 1 hour) or longer, use mtime or atime instead of ctime.
Access to data - topology
Chunkservers that are being used as computational nodes will have the client (mount) process running on them to provide data to whatever computational task needs it. There are some strategies that can help make this access even more efficient.
MooseFS has something called topology. You can define your network's topology in a configuration file mfstopology.cfg, to indicate the "distance" between various machines in your network (e.g. which ones are connected to the same switch, which ones are connected by different switches in the same rack, which are in different racks etc.). Topology is then used by Master and Client to optimise I/O:
-
Master will sort copies of chunks needed by a client process for an I/O operation according to topological distance from that client, that means the copies "closest" to the client will be at the begining of the list; thanks to that mechanic the client can always perform the read operation from the copy with the shortest network distance (so, hopefully, the lowest latency), also, the write operation will be optimised as copies that need to be modified are sorted according to distance.
-
When topology is defined, you can also set the
CREATIONS_RESPECT_TOPOLOGYvariable inmfsmaster.cfgconfig file to a value greater than 0, then Master, when it needs to create a new chunk, will try to create copies of that chunk on chunkservers with topological distance lower than the defined value from the client; this means a client recording new data (and thus creating new chunks) will do that in optimal way whenever possible (chunkservers have space and are not too busy with other tasks).
Access to data - labels
When it comes to the read operation, there is also another way of instructing MooseFS, from which copy the client should read data, and that is by the use of labels. When you mount a MooseFS share, you can add an -o mfspreflabels=LABELEXPR option to the mfsmount command. You have to substitute the string LABELEXPR with a list of labels (or a list of logical expressions consisting of labels, as they are defined by the mfsscadmin tool). A mount with this option will always try first to read a copy from a chunksever that matches one of the labels or label expressions provided with this option.
Some practical examples of how this mechanics can be used:
- There are 9 chunk servers and each has one computational task to handle, but each of them works on a different set of data. You can give each of those 9 chunk servers a different label (let's say letters from A to I) and define 9 storage classes: each storage class keeps 3 copies of data, one on a server with a specific label, two on any servers:
mfsscadmin create -K A,*,* my_comp_data_a
mfsscadmin create -K B,*,* my_comp_data_b
mfsscadmin create -K C,*,* my_comp_data_c
mfsscadmin create -K D,*,* my_comp_data_d
mfsscadmin create -K E,*,* my_comp_data_e
mfsscadmin create -K F,*,* my_comp_data_f
mfsscadmin create -K G,*,* my_comp_data_g
mfsscadmin create -K H,*,* my_comp_data_h
mfsscadmin create -K I,*,* my_comp_data_i
You mount the MooseFS share with -o mfspreflabels=A on the chunkserver with label A, -o mfspreflabels=B on the chunkserver with label B and so on. You then set the storage class for files and directories that will be used by the computational task on the chunkserver with label A to my_comp_data_a class, on chunkserver with label B to my_comp_data_b and so on. As a result, the read operation will almost always be performed from the local copy, so no unnecessary transfer of read data via network will be needed.
- There are two groups of chunk servers (at least 3 servers in each for the purpose of this example), one group is assigned one computational task, the other is assigned another task. You can assign all chunkservers from the first group one label (let's say
A) and all the chunkservers from the second group another label (let's sayB). You can then define 2 storage classes:
mfsscadmin create -K 3A my_data_group_a
mfsscadmin create -K 3B my_data_group_b
You mount the MooseFS share with -o mfspreflabels=A on the chunkservers in the first group (labeled A), and with -o mfspreflabels=B on the chunkservers in the second group (labeled B). Then all the files needed by the first computational task will have their chunk copies located within that group, with a higher probability of at least one copy being local (on the same chunkserver as the client reading it). If you also define topology for your chunks (if it makes sense, i.e. not all chunkservers are connected to the same switch), it will help further, as both those mechanics will cooperate to give the client the best copy of a chunk to read and the best order of copies to write.