Production ML Systems - Tuning System Performance

Production ML Systems - Designing High-Performance ML Systems

Tuning ML System performance

1.

If each of your examples is large in terms of size and requires parsing, and your model is relatively simple and shallow, your model is likely to be:

  • ⭕ I/O bound, so you should look for ways to store data more efficiently and ways to parallelize the reads.
    • I/O bound
      1. the number of inputs is large
      2. heterogeneous (requires parsing)
      3. small modell that the compute requirements are trivial.
      4. the input data is on a storage system with low throughput.
    • Solutions for I/O bound
      1. storing the data more efficiently
      2. on a storage system with higher throughput.
      3. parallelizing the reads.
      4. (not ideal) consider reducing the batch size so that you are reading less data in each step.
  • CPU-bound, so you should use GPUs or TPUs.
  • Latency-bound, so you should use faster hardware

2.

Which of the following indicates that ML training is CPU bound?

  • ❌ If I/O is complex, but the model involves lots of complex/expensive computations.
    • I/O bound ML training
  • If you are running a model on powered hardware.
  • ⭕ If I/O is simple, but the model involves lots of complex/expensive computations.
    • CPU bound commonly occurs with Expensive computations and/or Underpowered Hardware
  • If you are running a model on accelerated hardware.

3.

For the fastest I/O performance in TensorFlow (check all that apply)

  • ⭕ Read TF records into your model. : dataset = tf.data.TFRecordDataset(...)
    • Records are set for fast, efficient, batch reads, without the overhead of having to parse the data in Python.
  • ⭕ Read in parallel threads. : dataset = tf.data.TFRecordDataset(files, num_parallel_reads=40)
    • When dealing with a large dataset sharded across Cloud Storage, speed up by reading multiple files in parallel to increase the effective throughput.
    • can use this feature with a single option to the TFRecordDataset constructor num_parallel_reads.
  • ⭕ Use fused operations. : shuffle_and_repeat, map_and_batch
    • parallelizes both the execution of the map function and the data transfer of each element into the batch tensors.
  • ⭕ Prefetch the data : dataset.prefetch
    • Decouples the time data is produced from the time it is consumed. It prefetches the data into a buffer in parallel with the training step. This means that we have input data for the next training step before the current one is completed.

4.

What does high-performance machine learning determine?

  • Time taken to train a modelOne key aspect is the time taken to train a model.
  • Reliability of a model
  • Deploying a model
  • Training a model