Heap buffer overflow in `StringNGrams`
Description
TensorFlow is an end-to-end open source platform for machine learning. An attacker can cause a heap buffer overflow by passing crafted inputs to tf.raw_ops.StringNGrams. This is because the implementation(https://github.com/tensorflow/tensorflow/blob/1cdd4da14282210cc759e468d9781741ac7d01bf/tensorflow/core/kernels/string_ngrams_op.cc#L171-L185) fails to consider corner cases where input would be split in such a way that the generated tokens should only contain padding elements. If input is such that num_tokens is 0, then, for data_start_index=0 (when left padding is present), the marked line would result in reading data[-1]. The fix will be included in TensorFlow 2.5.0. We will also cherrypick this commit on TensorFlow 2.4.2, TensorFlow 2.3.3, TensorFlow 2.2.3 and TensorFlow 2.1.4, as these are also affected and still in supported range.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
tensorflowPyPI | < 2.1.4 | 2.1.4 |
tensorflowPyPI | >= 2.2.0, < 2.2.3 | 2.2.3 |
tensorflowPyPI | >= 2.3.0, < 2.3.3 | 2.3.3 |
tensorflowPyPI | >= 2.4.0, < 2.4.2 | 2.4.2 |
tensorflow-cpuPyPI | < 2.1.4 | 2.1.4 |
tensorflow-cpuPyPI | >= 2.2.0, < 2.2.3 | 2.2.3 |
tensorflow-cpuPyPI | >= 2.3.0, < 2.3.3 | 2.3.3 |
tensorflow-cpuPyPI | >= 2.4.0, < 2.4.2 | 2.4.2 |
tensorflow-gpuPyPI | < 2.1.4 | 2.1.4 |
tensorflow-gpuPyPI | >= 2.2.0, < 2.2.3 | 2.2.3 |
tensorflow-gpuPyPI | >= 2.3.0, < 2.3.3 | 2.3.3 |
tensorflow-gpuPyPI | >= 2.4.0, < 2.4.2 | 2.4.2 |
Affected products
1- Range: < 2.1.4
Patches
1ba424dd8f16fEnhance validation of ngram op and handle case of 0 tokens.
2 files changed · +75 −11
tensorflow/core/kernels/string_ngrams_op.cc+41 −11 modified@@ -61,16 +61,28 @@ class StringNGramsOp : public tensorflow::OpKernel { OP_REQUIRES_OK(context, context->input("data_splits", &splits)); const auto& splits_vec = splits->flat<SPLITS_TYPE>(); - // Validate that the splits are valid indices into data + // Validate that the splits are valid indices into data, only if there are + // splits specified. const int input_data_size = data->flat<tstring>().size(); const int splits_vec_size = splits_vec.size(); - for (int i = 0; i < splits_vec_size; ++i) { - bool valid_splits = splits_vec(i) >= 0; - valid_splits = valid_splits && (splits_vec(i) <= input_data_size); - OP_REQUIRES( - context, valid_splits, - errors::InvalidArgument("Invalid split value ", splits_vec(i), - ", must be in [0,", input_data_size, "]")); + if (splits_vec_size > 0) { + int prev_split = splits_vec(0); + OP_REQUIRES(context, prev_split == 0, + errors::InvalidArgument("First split value must be 0, got ", + prev_split)); + for (int i = 1; i < splits_vec_size; ++i) { + bool valid_splits = splits_vec(i) >= prev_split; + valid_splits = valid_splits && (splits_vec(i) <= input_data_size); + OP_REQUIRES(context, valid_splits, + errors::InvalidArgument( + "Invalid split value ", splits_vec(i), ", must be in [", + prev_split, ", ", input_data_size, "]")); + prev_split = splits_vec(i); + } + OP_REQUIRES(context, prev_split == input_data_size, + errors::InvalidArgument( + "Last split value must be data size. Expected ", + input_data_size, ", got ", prev_split)); } int num_batch_items = splits_vec.size() - 1; @@ -174,13 +186,31 @@ class StringNGramsOp : public tensorflow::OpKernel { ngram->append(left_pad_); ngram->append(separator_); } + // Only output first num_tokens - 1 pairs of data and separator for (int n = 0; n < num_tokens - 1; ++n) { ngram->append(data[data_start_index + n]); ngram->append(separator_); } - ngram->append(data[data_start_index + num_tokens - 1]); - for (int n = 0; n < right_padding; ++n) { - ngram->append(separator_); + // Handle case when there are no tokens or no right padding as these can + // result in consecutive separators. + if (num_tokens > 0) { + // If we have tokens, then output last and then pair each separator with + // the right padding that follows, to ensure ngram ends either with the + // token or with the right pad. + ngram->append(data[data_start_index + num_tokens - 1]); + for (int n = 0; n < right_padding; ++n) { + ngram->append(separator_); + ngram->append(right_pad_); + } + } else { + // If we don't have tokens, then the last item inserted into the ngram + // has been the separator from the left padding loop above. Hence, + // output right pad and separator and make sure to finish with a + // padding, not a separator. + for (int n = 0; n < right_padding - 1; ++n) { + ngram->append(right_pad_); + ngram->append(separator_); + } ngram->append(right_pad_); }
tensorflow/core/kernels/string_ngrams_op_test.cc+34 −0 modified@@ -542,6 +542,40 @@ TEST_F(NgramKernelTest, TestEmptyInput) { assert_int64_equal(expected_splits, *GetOutput(1)); } +TEST_F(NgramKernelTest, TestNoTokens) { + MakeOp("|", {3}, "L", "R", -1, false); + // Batch items are: + // 0: + // 1: "a" + AddInputFromArray<tstring>(TensorShape({1}), {"a"}); + AddInputFromArray<int64>(TensorShape({3}), {0, 0, 1}); + TF_ASSERT_OK(RunOpKernel()); + + std::vector<tstring> expected_values( + {"L|L|R", "L|R|R", // no input in first split + "L|L|a", "L|a|R", "a|R|R"}); // second split + std::vector<int64> expected_splits({0, 2, 5}); + + assert_string_equal(expected_values, *GetOutput(0)); + assert_int64_equal(expected_splits, *GetOutput(1)); +} + +TEST_F(NgramKernelTest, TestNoTokensNoPad) { + MakeOp("|", {3}, "", "", 0, false); + // Batch items are: + // 0: + // 1: "a" + AddInputFromArray<tstring>(TensorShape({1}), {"a"}); + AddInputFromArray<int64>(TensorShape({3}), {0, 0, 1}); + TF_ASSERT_OK(RunOpKernel()); + + std::vector<tstring> expected_values({}); + std::vector<int64> expected_splits({0, 0, 0}); + + assert_string_equal(expected_values, *GetOutput(0)); + assert_int64_equal(expected_splits, *GetOutput(1)); +} + TEST_F(NgramKernelTest, ShapeFn) { ShapeInferenceTestOp op("StringNGrams"); INFER_OK(op, "?;?", "[?];[?]");
Vulnerability mechanics
Generated by null/stub on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
7- github.com/advisories/GHSA-4hrh-9vmp-2jggghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2021-29542ghsaADVISORY
- github.com/pypa/advisory-database/tree/main/vulns/tensorflow-cpu/PYSEC-2021-470.yamlghsaWEB
- github.com/pypa/advisory-database/tree/main/vulns/tensorflow-gpu/PYSEC-2021-668.yamlghsaWEB
- github.com/pypa/advisory-database/tree/main/vulns/tensorflow/PYSEC-2021-179.yamlghsaWEB
- github.com/tensorflow/tensorflow/commit/ba424dd8f16f7110eea526a8086f1a155f14f22bghsax_refsource_MISCWEB
- github.com/tensorflow/tensorflow/security/advisories/GHSA-4hrh-9vmp-2jggghsax_refsource_CONFIRMWEB
News mentions
0No linked articles in our index yet.